Josiah Berkebile commented on SPARK-5997:

Maybe it's worth mentioning that I've had to do this for several NLP-related 
Spark jobs because of the amount of RAM it consumes to build-out the parsing 
tree.  Increasing the partition count proportional to the total row count so 
that each partition had roughly 'X' number of rows both increased parallelism 
over the cluster and also allowed me to reduce pressure on the RAM/Heap 
utilization.  In these sorts of scenarios, exact balance across the partitions 
isn't of critical importance, so performing a shuffle just to maintain balance 
is detrimental to the overall job performance.

> Increase partition count without performing a shuffle
> -----------------------------------------------------
>                 Key: SPARK-5997
>                 URL: https://issues.apache.org/jira/browse/SPARK-5997
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Andrew Ash
>            Priority: Major
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to