Thanks Sean for very useful comments. I understand now better what could be the reasons that my evaluations are messed up.
best, /Shahab On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen <so...@cloudera.com> wrote: > Yes partitions matter. Usually you can use the default, which will > make a partition per input split, and that's usually good, to let one > task process one block of data, which will all be on one machine. > > Reasons I could imagine why 9 partitions is faster than 7: > > Probably: Your cluster can execute at least 9 tasks concurrently. It > will finish faster since each partition is smaller when split into 9 > partitions. This just means you weren't using your cluster's full > parallelism at 7. > > 9 partitions lets tasks execute entirely locally to the data, whereas > 7 is too few compared to how the data blocks are distributed on HDFS. > That is, maybe 7 is inducing a shuffle whereas 9 is not for some > reason in your code. > > Your executors are running near their memory limit and are thrashing > in GC. With less data to process each, you may avoid thrashing and so > go a lot faster. > > (Or there's some other factor that messed up your measurements :)) > > > There can be instances where more partitions is slower too. > > On Mon, Nov 3, 2014 at 9:57 AM, shahab <shahab.mok...@gmail.com> wrote: > > Hi, > > > > I just wonder how number of partitions effect the performance in Spark! > > > > Is it just the parallelism (more partitions, more parallel sub-tasks) > that > > improves the performance? or there exist other considerations? > > > > In my case,I run couple of map/reduce jobs on same dataset two times with > > two different partition numbers, 7 and 9. I used a stand alone cluster, > with > > two workers on each, where the master resides with the same machine as > one > > of the workers. > > > > Surprisingly, the performance of map/reduce jobs in case of 9 partitions > is > > almost 4X-5X better than that of 7 partitions !?? Does it mean that > > choosing right number of partitions is the key factor in the Spark > > performance ? > > > > best, > > /Shahab >