Hi Folks,
Does anybody know what is the reason not allowing preserverPartitioning in
RDD.map? Do I miss something here?
Following example involves two shuffles. I think if preservePartitioning is
allowed, we can avoid the second one, right?
val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
val r2 = r1.map((_, 1))
val r3 = r2.reduceByKey(_+_)
val r4 = r3.map(x=>(x._1, x._2 + 1))
val r5 = r4.reduceByKey(_+_)
r5.collect.foreach(println)
scala> r5.toDebugString
res2: String =
(8) ShuffledRDD[4] at reduceByKey at <console>:29 []
+-(8) MapPartitionsRDD[3] at map at <console>:27 []
| ShuffledRDD[2] at reduceByKey at <console>:25 []
+-(8) MapPartitionsRDD[1] at map at <console>:23 []
| ParallelCollectionRDD[0] at parallelize at <console>:21 []
Thanks.
Zhan Zhang
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]