Hi Fei,

I looked at the code of CoalescedRDD and probably what I suggested will not
work.

Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75

Maybe you could try to use the CoalescedRDD code to implement your
requirement.

Good luck!
Cheers,
Anastasios


On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hufe...@gmail.com> wrote:

> Hi Anastasios,
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
> Thanks,
> Fei
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com>
> wrote:
>
>> Hi Fei,
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>> Best,
>> Anastasios
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>> Thanks in advance,
>>> Fei
>>>
>>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> <a...@zurich.ibm.com>
>>
>
>


-- 
-- Anastasios Zouzias
<a...@zurich.ibm.com>

Reply via email to