[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

uncleGen (JIRA) Wed, 10 Dec 2014 01:11:22 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


uncleGen updated SPARK-3376:
----------------------------
    Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Following is my testing on "InMemory Shuffle"

| data size        |  partitions  |  resources |
| 5131859218  |    2000       |   50 executors/ 4 cores/ 4GB |

| settings               |  operation1                                   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s                   |  16s |
|sort     |   45s                   |  28s |
|hash     |   46s                   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s                         | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s                           | 27s |
|sort  |  29s                           | 29s |
|hash  |  41s                           | 30s |
|no shuffle spill & lzf | | |
| memory |  15s                         | 16s |

  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| data size        |  partitions  |  resources |
| 5131859218  |    2000       |   50 executors/ 4 cores/ 4GB |

| settings               |  operation1                                   | 
operation2 |
| shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
groupByKey | 
|memory   |   38s                   |  16s |
|sort     |   45s                   |  28s |
|hash     |   46s                   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s                         | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s                           | 27s |
|sort  |  29s                           | 29s |
|hash  |  41s                           | 30s |
|no shuffle spill & lzf | | |
| memory |  15s                         | 16s |


> Memory-based shuffle strategy to reduce overhead of disk I/O
> ------------------------------------------------------------
>
>                 Key: SPARK-3376
>                 URL: https://issues.apache.org/jira/browse/SPARK-3376
>             Project: Spark
>          Issue Type: Planned Work
>            Reporter: uncleGen
>            Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> Following is my testing on "InMemory Shuffle"
> | data size        |  partitions  |  resources |
> | 5131859218  |    2000       |   50 executors/ 4 cores/ 4GB |
> | settings               |  operation1                                   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s                   |  16s |
> |sort     |   45s                   |  28s |
> |hash     |   46s                   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s                         | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s                           | 27s |
> |sort  |  29s                           | 29s |
> |hash  |  41s                           | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s                         | 16s |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

Reply via email to