[
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-15690:
-------------------------------------
Target Version/s: 2.3.0 (was: 2.2.0)
> Fast single-node (single-process) in-memory shuffle
> ---------------------------------------------------
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
> Issue Type: New Feature
> Components: Shuffle, SQL
> Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their
> partition id, and then write the data to disk. This is not a big bottleneck
> because the network throughput on commodity clusters tend to be low. However,
> an increasing number of Spark users are using the system to process data on a
> single-node. When in a single node operating against intermediate data that
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort
> to do data shuffling on a single node, and still gracefully fallback to disk
> if the data size does not fit in memory. Given the number of partitions is
> usually small (say less than 256), it'd require only a single pass do to the
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This
> ticket has a smaller scope (single-process), and aims to actually
> productionize this code.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]