rdd.take() failed when input partition is larger than hdfs blocksize

Chen Jin Thu, 06 Mar 2014 18:18:06 -0800

Hello Spark Developers,

While trying to use


           rdd.take(numItems)

My job just hangs there forever, the following is output messages:

14/03/07 00:52:21 INFO SparkContext: Starting job: take at xx.java:55
14/03/07 00:52:21 INFO DAGScheduler: Got job 1 (take at  xx.java:55)
with 1 output partitions (allowLocal=true)
14/03/07 00:52:21 INFO DAGScheduler: Final stage: Stage 1 (take at
LimitRDDTransformOperation.java:55)
14/03/07 00:52:21 INFO DAGScheduler: Parents of final stage: List()
14/03/07 00:52:21 INFO DAGScheduler: Missing parents: List()
14/03/07 00:52:21 INFO DAGScheduler: Computing the requested partition locally
14/03/07 00:52:21 INFO HadoopRDD: Input split:
hdfs://ec2:9000/event_0000.csv:0+134217728
14/03/07 00:52:22 INFO SparkContext: Job finished: take at xx.java:55,
took 1.705299577 s
14/03/07 00:52:23 INFO SparkContext: Starting job: take at xx.java:55
14/03/07 00:52:23 INFO DAGScheduler: Got job 2 (take at xx.java:55)
with 2 output partitions (allowLocal=true)
14/03/07 00:52:23 INFO DAGScheduler: Final stage: Stage 2 (take at xx.java:55)


14/03/07 00:52:23 INFO DAGScheduler: Parents of final stage: List()
14/03/07 00:52:23 INFO DAGScheduler: Missing parents: List()
14/03/07 00:52:23 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2]
at textFile at yy.java:215), which has no missing parents
14/03/07 00:52:23 INFO DAGScheduler: Submitting 2 missing tasks from
Stage 2 (MappedRDD[2] at textFile at yy.java:215)
14/03/07 00:52:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
14/03/07 00:52:23 INFO TaskSetManager: Starting task 2.0:0 as TID 150
on executor 1: ip-172-31-10-192.us-west-1.compute.internal
(NODE_LOCAL)
14/03/07 00:52:23 INFO TaskSetManager: Serialized task 2.0:0 as 16073
bytes in 0 ms
14/03/07 00:52:23 INFO TaskSetManager: Starting task 2.0:1 as TID 151
on executor 4: ip-172-31-10-193.us-west-1.compute.internal
(NODE_LOCAL)
14/03/07 00:52:23 INFO TaskSetManager: Serialized task 2.0:1 as 16073 bytes in 0


In 14/03/07 00:52:21 INFO HadoopRDD: Input split:
hdfs://ec2:9000/part_0000.csv:0+134217728

134217728 is the block size hdfs currently uses.

First, I don't understand the line: Got job 2 (take at xx.java:55)
with 2 output partitions (allowLocal=true) since rdd.take() is already
submitted as Job 1.

Since numItems is more than one block holds, I am guess it is trying
to fetch the second block, however, the program hangs here.

When I increase hdfs blocksize larger than the input file
event_0000.csv, the program works again.

Could you anybody help me understand the cause?

Thanks a lot,

-chen

rdd.take() failed when input partition is larger than hdfs blocksize

Reply via email to