Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work.
Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75 Maybe you could try to use the CoalescedRDD code to implement your requirement. Good luck! Cheers, Anastasios On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hufe...@gmail.com> wrote: > Hi Anastasios, > > Thanks for your reply. If I just increase the numPartitions to be twice > larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps > the data locality? Do I need to define my own Partitioner? > > Thanks, > Fei > > On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com> > wrote: > >> Hi Fei, >> >> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? >> >> https://github.com/apache/spark/blob/branch-1.6/core/src/ >> main/scala/org/apache/spark/rdd/RDD.scala#L395 >> >> coalesce is mostly used for reducing the number of partitions before >> writing to HDFS, but it might still be a narrow dependency (satisfying your >> requirements) if you increase the # of partitions. >> >> Best, >> Anastasios >> >> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote: >> >>> Dear all, >>> >>> I want to equally divide a RDD partition into two partitions. That >>> means, the first half of elements in the partition will create a new >>> partition, and the second half of elements in the partition will generate >>> another new partition. But the two new partitions are required to be at the >>> same node with their parent partition, which can help get high data >>> locality. >>> >>> Is there anyone who knows how to implement it or any hints for it? >>> >>> Thanks in advance, >>> Fei >>> >>> >> >> >> -- >> -- Anastasios Zouzias >> <a...@zurich.ibm.com> >> > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>