Hi list,

I ran into an issue which I think could be a bug.

I have a Hive table stored as parquet files. Let's say it's called
testtable. I found the code below stuck forever in spark-shell with a local
master or driver/executor:
sqlContext.sql("select * from testtable").rdd.cache.zipWithIndex().count

But it works if I use a standalone master.

I also tried several different variants:
don't cache the rdd(works):
sqlContext.sql("select * from testtable").rdd.zipWithIndex().count

cache the rdd after zipWithIndex(works):
sqlContext.sql("select * from testtable").rdd.zipWithIndex().cache.count

use parquet file reader(doesn't work):
sqlContext.read.parquet("hdfs://localhost:8020/user/hive/warehouse/testtable").rdd.cache.zipWithIndex().count

use parquet files on local file system(works)
sqlContext.read.parquet("/tmp/testtable").rdd.cache.zipWithIndex().count

I read the code of zipWithIndex() and the DAG visualization. I think the
function cause the Spark firstly retrieve n-1 partitions of target table
and cache them, then the last partition. It must be something wrong when
the driver/executor tries to read the last parition from HDFS .

I am using spark-1.5.2-bin-hadoop-2.6 on cloudera quickstart vm 5.4.2.

-- 
Kai Wei
Big Data Developer

Pythian - love your data

w...@pythian.com
Tel: +1 613 565 8696 x1579
Mobile: +61 403 572 456

-- 


--



Reply via email to