Hi Pradeep,

That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?

Thanks,
Fei

On Mon, Jan 16, 2017 at 2:07 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Usually this kind of thing can be done at a lower level in the InputFormat
> usually by specifying the max split size. Have you looked into that
> possibility with your InputFormat?
>
> On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hufe...@gmail.com> wrote:
>
>> Hi Jasbir,
>>
>> Yes, you are right. Do you have any idea about my question?
>>
>> Thanks,
>> Fei
>>
>> On Mon, Jan 16, 2017 at 12:37 AM, <jasbir.s...@accenture.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> Coalesce is used to decrease the number of partitions. If you give the
>>> value of numPartitions greater than the current partition, I don’t think
>>> RDD number of partitions will be increased.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Jasbir
>>>
>>>
>>>
>>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>>> *Sent:* Sunday, January 15, 2017 10:10 PM
>>> *To:* zouz...@cs.toronto.edu
>>> *Cc:* user @spark <u...@spark.apache.org>; dev@spark.apache.org
>>> *Subject:* Re: Equally split a RDD partition into two partition at the
>>> same node
>>>
>>>
>>>
>>> Hi Anastasios,
>>>
>>>
>>>
>>> Thanks for your reply. If I just increase the numPartitions to be twice
>>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false)
>>> keeps the data locality? Do I need to define my own Partitioner?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Fei
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com>
>>> wrote:
>>>
>>> Hi Fei,
>>>
>>>
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>>
>>>
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>>
>>>
>>> Best,
>>>
>>> Anastasios
>>>
>>>
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote:
>>>
>>> Dear all,
>>>
>>>
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>>
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Fei
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> -- Anastasios Zouzias
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise confidential information. If you have
>>> received it in error, please notify the sender immediately and delete the
>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>> by local law, electronic communications with Accenture and its affiliates,
>>> including e-mail and instant messaging (including content), may be scanned
>>> by our systems for the purposes of information security and assessment of
>>> internal compliance with Accenture policy.
>>> ____________________________________________________________
>>> __________________________
>>>
>>> www.accenture.com
>>>
>>
>>
>

Reply via email to