Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.

best,
/Shahab

On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes partitions matter. Usually you can use the default, which will
> make a partition per input split, and that's usually good, to let one
> task process one block of data, which will all be on one machine.
>
> Reasons I could imagine why 9 partitions is faster than 7:
>
> Probably: Your cluster can execute at least 9 tasks concurrently. It
> will finish faster since each partition is smaller when split into 9
> partitions. This just means you weren't using your cluster's full
> parallelism at 7.
>
> 9 partitions lets tasks execute entirely locally to the data, whereas
> 7 is too few compared to how the data blocks are distributed on HDFS.
> That is, maybe 7 is inducing a shuffle whereas 9 is not for some
> reason in your code.
>
> Your executors are running near their memory limit and are thrashing
> in GC. With less data to process each, you may avoid thrashing and so
> go a lot faster.
>
> (Or there's some other factor that messed up your measurements :))
>
>
> There can be instances where more partitions is slower too.
>
> On Mon, Nov 3, 2014 at 9:57 AM, shahab <shahab.mok...@gmail.com> wrote:
> > Hi,
> >
> > I just wonder how number of partitions effect the performance in Spark!
> >
> > Is it just the parallelism (more partitions, more parallel sub-tasks)
> that
> > improves the performance? or there exist other considerations?
> >
> > In my case,I run couple of map/reduce jobs on same dataset two times with
> > two different partition numbers, 7 and 9. I used a stand alone cluster,
> with
> > two workers on each, where the master resides with the same machine as
> one
> > of the workers.
> >
> > Surprisingly, the performance of map/reduce jobs in case of 9 partitions
> is
> > almost  4X-5X better than that of 7 partitions !??  Does it mean that
> > choosing right number of partitions is the key factor in the Spark
> > performance ?
> >
> > best,
> > /Shahab
>

Reply via email to