Hi Folks, Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss something here?
Following example involves two shuffles. I think if preservePartitioning is allowed, we can avoid the second one, right? val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)) val r2 = r1.map((_, 1)) val r3 = r2.reduceByKey(_+_) val r4 = r3.map(x=>(x._1, x._2 + 1)) val r5 = r4.reduceByKey(_+_) r5.collect.foreach(println) scala> r5.toDebugString res2: String = (8) ShuffledRDD[4] at reduceByKey at <console>:29 [] +-(8) MapPartitionsRDD[3] at map at <console>:27 [] | ShuffledRDD[2] at reduceByKey at <console>:25 [] +-(8) MapPartitionsRDD[1] at map at <console>:23 [] | ParallelCollectionRDD[0] at parallelize at <console>:21 [] Thanks. Zhan Zhang --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org