[
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395405#comment-16395405
]
Josiah Berkebile commented on SPARK-5997:
-----------------------------------------
This functionality can be useful when re-partitioning to output to a storage
system like HDFS that doesn't handle large amounts of small files well. I've
had a few Spark applications I've written where I've had to fan-out the number
of partitions to reduce stress on the cluster for heavy calculations, and then
reduce that number back down to one partition per executor to reduce the number
of output files to HDFS in order to keep the block-count down.
However, I can't think of a scenario where it would make sense to reduce the
number of partitions to be less than the number of executors. If someone needs
to do this, I think it would be more appropriate to schedule a job to sweep-up
after Spark and concatenate its output into a single file.
> Increase partition count without performing a shuffle
> -----------------------------------------------------
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Andrew Ash
> Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the
> user has the ability to choose whether or not to perform a shuffle. However
> when increasing partition count there is no option of whether to perform a
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition
> enough that the .toLocalIterator has significantly reduced memory pressure on
> the driver, as it loads a partition at a time into the driver.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]