repartition() is actually just an alias of coalesce(), but which the
shuffle flag to set to true. This shuffle is probably what you're seeing as
taking longer, but it is required when you go from a smaller number of
partitions to a larger.

When actually decreasing the number of partitions, coalesce(shuffle =
false) will be fully pipelined, but is limited in how it can redistribute
data, as it can only combine whole partitions into larger partitions. For
example, if you have an rdd with 101 partitions, and you do
rdd.coalesce(100, shuffle = false), then the resultant rdd will have 99 of
the original partitions, and 1 partition will just be 2 original partitions
combined. This can lead to increased data skew, but requires no effort to
create.

On the other hand, if you do rdd.coalesce(100, shuffle = true), then all of
the data will actually be reshuffled into 100 new evenly-sized partitions,
eliminating any data skew at the cost of actually moving all data around.


On Tue, Jun 17, 2014 at 4:52 PM, abhiguruvayya <sharath.abhis...@gmail.com>
wrote:

> I found the main reason to be that i was using coalesce instead of
> repartition. coalesce was shrinking the portioning so the number of tasks
> were very less to be executed by all of the executors. Can you help me in
> understudying when to use coalesce and when to use repartition. In
> application coalesce is being processed faster then repartition. Which is
> unusual.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to