[
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
uncleGen updated SPARK-3376:
----------------------------
Description:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just
want to know is there any plan to do something about it. Or any suggestion
about it. Base on the work (SPARK-2044), it is feasible to have several
implementations of shuffle.
Following is my testing on "InMemory Shuffle"
| data size | partitions | resources |
| 5131859218 | 2000 | 50 executors/ 4 cores/ 4GB |
| settings | operation1 |
operation2 |
| shuffle spill & lz4 | repartition+flatMap+groupByKey | repartition +
groupByKey |
|memory | 38s | 16s |
|sort | 45s | 28s |
|hash | 46s | 28s |
|no shuffle spill & lz4 | | |
| memory | 16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory| 28s | 27s |
|sort | 29s | 29s |
|hash | 41s | 30s |
|no shuffle spill & lzf | | |
| memory | 15s | 16s |
was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just
want to know is there any plan to do something about it. Or any suggestion
about it. Base on the work (SPARK-2044), it is feasible to have several
implementations of shuffle.
| data size | partitions | resources |
| 5131859218 | 2000 | 50 executors/ 4 cores/ 4GB |
| settings | operation1 |
operation2 |
| shuffle spill & lz4 | rePartition+flatMap+groupByKey | rePartition +
groupByKey |
|memory | 38s | 16s |
|sort | 45s | 28s |
|hash | 46s | 28s |
|no shuffle spill & lz4 | | |
| memory | 16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory| 28s | 27s |
|sort | 29s | 29s |
|hash | 41s | 30s |
|no shuffle spill & lzf | | |
| memory | 15s | 16s |
> Memory-based shuffle strategy to reduce overhead of disk I/O
> ------------------------------------------------------------
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
> Issue Type: Planned Work
> Reporter: uncleGen
> Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just
> want to know is there any plan to do something about it. Or any suggestion
> about it. Base on the work (SPARK-2044), it is feasible to have several
> implementations of shuffle.
> Following is my testing on "InMemory Shuffle"
> | data size | partitions | resources |
> | 5131859218 | 2000 | 50 executors/ 4 cores/ 4GB |
> | settings | operation1 |
> operation2 |
> | shuffle spill & lz4 | repartition+flatMap+groupByKey | repartition +
> groupByKey |
> |memory | 38s | 16s |
> |sort | 45s | 28s |
> |hash | 46s | 28s |
> |no shuffle spill & lz4 | | |
> | memory | 16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory| 28s | 27s |
> |sort | 29s | 29s |
> |hash | 41s | 30s |
> |no shuffle spill & lzf | | |
> | memory | 15s | 16s |
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]