I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time that that means each record is being pulled at 1 time. That's the
beauty of exposing Iterators & Iterables in an API rather than collections-
there's a bunch of buffering that can be hidden from the user to make the
iterations as efficient as they can be.

On Thu, Jun 25, 2015 at 11:36 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> yes, 1 partition per core and  mapPartitions apply function on each
> partition.
>
> Question is Does complete partition loads in memory so that function can
> be applied to it or its an iterator and iterator.next() loads next record
> and if yes then how is it efficient than map which also works on 1 record
> at a time.
>
>
> Is the only difference is -- only while loop as in below runs per record
> as in map . But code above that will be run once per partition.
>
>
> public Iterable<Integer> call(Iterator<String> input)
> throws Exception {
> List<Integer> output = new ArrayList<Integer>();
> while(input.hasNext()){
> output.add(input.next().length());
>  }
>
>
> so if I don't have any heavy code above while loop, performance will be
> same as of map function.
>
>
>
> On Thu, Jun 25, 2015 at 6:51 PM, Hao Ren <inv...@gmail.com> wrote:
>
>> It's not the number of executors that matters, but the # of the CPU cores
>> of your cluster.
>>
>> Each partition will be loaded on a core for computing.
>>
>> e.g. A cluster of 3 nodes has 24 cores, and you divide the RDD in 24
>> partitions (24 tasks for narrow dependency).
>> Then all the 24 partitions will be loaded to your cluster in parallel,
>> one on each core.
>> You may notice that some tasks will finish more quickly than others. So
>> divide the RDD into (2~3) x (# of cores) for better pipeline performance.
>> Say we have 72 partitions in your RDD, then initially 24 tasks run on 24
>> cores, then first done first served until all 72 tasks are processed.
>>
>> Back to your origin question, map and mapPartitions are both
>> transformation, but on different granularity.
>> map => apply the function on each record in each partition.
>> mapPartitions => apply the function on each partition.
>> But the rule is the same, one partition per core.
>>
>> Hope it helps.
>> Hao
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 1:28 PM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> say source is HDFS,And file is divided in 10 partitions. so what will be
>>>  input contains.
>>>
>>> public Iterable<Integer> call(Iterator<String> input)
>>>
>>> say I have 10 executors in job each having single partition.
>>>
>>> will it have some part of partition or complete. And if some when I call
>>> input.next() - it will fetch rest or how is it handled ?
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 25, 2015 at 3:11 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> No, or at least, it depends on how the source of the partitions was
>>>> implemented.
>>>>
>>>> On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora
>>>> <shushantaror...@gmail.com> wrote:
>>>> > Does mapPartitions keep complete partitions in memory of executor as
>>>> > iterable.
>>>> >
>>>> > JavaRDD<String> rdd = jsc.textFile("path");
>>>> > JavaRDD<Integer> output = rdd.mapPartitions(new
>>>> > FlatMapFunction<Iterator<String>, Integer>() {
>>>> >
>>>> > public Iterable<Integer> call(Iterator<String> input)
>>>> > throws Exception {
>>>> > List<Integer> output = new ArrayList<Integer>();
>>>> > while(input.hasNext()){
>>>> > output.add(input.next().length());
>>>> > }
>>>> > return output;
>>>> > }
>>>> >
>>>> > });
>>>> >
>>>> >
>>>> > Here does input is present in memory and can contain complete
>>>> partition of
>>>> > gbs ?
>>>> > Will this function call(Iterator<String> input) is called only for no
>>>> of
>>>> > partitions(say if I have 10 in this example) times. Not no of lines
>>>> > times(say 10000000) .
>>>> >
>>>> >
>>>> > And whats the use of mapPartitionsWithIndex ?
>>>> >
>>>> > Thanks
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>

Reply via email to