Hello Spark Developers, While trying to use
rdd.take(numItems) My job just hangs there forever, the following is output messages: 14/03/07 00:52:21 INFO SparkContext: Starting job: take at xx.java:55 14/03/07 00:52:21 INFO DAGScheduler: Got job 1 (take at xx.java:55) with 1 output partitions (allowLocal=true) 14/03/07 00:52:21 INFO DAGScheduler: Final stage: Stage 1 (take at LimitRDDTransformOperation.java:55) 14/03/07 00:52:21 INFO DAGScheduler: Parents of final stage: List() 14/03/07 00:52:21 INFO DAGScheduler: Missing parents: List() 14/03/07 00:52:21 INFO DAGScheduler: Computing the requested partition locally 14/03/07 00:52:21 INFO HadoopRDD: Input split: hdfs://ec2:9000/event_0000.csv:0+134217728 14/03/07 00:52:22 INFO SparkContext: Job finished: take at xx.java:55, took 1.705299577 s 14/03/07 00:52:23 INFO SparkContext: Starting job: take at xx.java:55 14/03/07 00:52:23 INFO DAGScheduler: Got job 2 (take at xx.java:55) with 2 output partitions (allowLocal=true) 14/03/07 00:52:23 INFO DAGScheduler: Final stage: Stage 2 (take at xx.java:55) 14/03/07 00:52:23 INFO DAGScheduler: Parents of final stage: List() 14/03/07 00:52:23 INFO DAGScheduler: Missing parents: List() 14/03/07 00:52:23 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2] at textFile at yy.java:215), which has no missing parents 14/03/07 00:52:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2 (MappedRDD[2] at textFile at yy.java:215) 14/03/07 00:52:23 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks 14/03/07 00:52:23 INFO TaskSetManager: Starting task 2.0:0 as TID 150 on executor 1: ip-172-31-10-192.us-west-1.compute.internal (NODE_LOCAL) 14/03/07 00:52:23 INFO TaskSetManager: Serialized task 2.0:0 as 16073 bytes in 0 ms 14/03/07 00:52:23 INFO TaskSetManager: Starting task 2.0:1 as TID 151 on executor 4: ip-172-31-10-193.us-west-1.compute.internal (NODE_LOCAL) 14/03/07 00:52:23 INFO TaskSetManager: Serialized task 2.0:1 as 16073 bytes in 0 In 14/03/07 00:52:21 INFO HadoopRDD: Input split: hdfs://ec2:9000/part_0000.csv:0+134217728 134217728 is the block size hdfs currently uses. First, I don't understand the line: Got job 2 (take at xx.java:55) with 2 output partitions (allowLocal=true) since rdd.take() is already submitted as Job 1. Since numItems is more than one block holds, I am guess it is trying to fetch the second block, however, the program hangs here. When I increase hdfs blocksize larger than the input file event_0000.csv, the program works again. Could you anybody help me understand the cause? Thanks a lot, -chen