Thanks Marcelo! The reason I was asking that question is that I was expecting my spark job to be a "map only" job. In other words, it should finish after the mapPartitions run for all partitions. This is because the job is only mapPartitions() plus count() where mapPartitions only yield one integer for each partition. The first stage running "count at /root/workspace/**/mapred/aerospike_calculations.py:35" completed after reasonably long time. I was expecting the job to complete right away after the first stage is complete. To my surprise, the second stage calling "collect at NativeMethodAccessorImpl.java:-2" runs super slow, about as slow as the first stage.
I want to know what the second stage is doing.. ================================UI============================ Spark Stages Total Duration: 8.2 h Scheduling Mode: FIFO Active Stages: 1 Completed Stages: 2 Failed Stages: 0 Active Stages (1) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Shuffle Read Shuffle Write 2 (kill) collect at NativeMethodAccessorImpl.java:-2 +details 2015/08/13 16:01:59 4.1 h 360/2048 375.1 GB Completed Stages (2) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Shuffle Read Shuffle Write 1 count at /root/workspace/**/aerospike_calculations.py:35 2015/08/13 12:02:40 7.5 h 2048/2048 1785.6 GB 0 first at SerDeUtil.scala:70 +details 2015/08/13 12:02:34 4 s 1/1 839.0 MB Failed Stages (0) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Shuffle Read Shuffle Write Failure Reason -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-does-NativeMethodAccessorImpl-java-do-tp13667p13684.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org