[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Jung updated SPARK-5081:
------------------------------
    Description: 
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger 
and it causes the jobs take more time to complete. 
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.

spark 1.1
||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
|9|saveAsTextFile| |1169.4KB| |
|12|combineByKey| |1265.4KB|1275.0KB|
|6|sortByKey| |1276.5KB| |
|8|mapPartitions| |91.0MB|1383.1KB|
|4|apply| |89.4MB| |
|5|sortBy|155.6MB| |98.1MB|
|3|sortBy|155.6MB| | |
|1|collect| |2.1MB| |
|2|mapValues|155.6MB| |2.2MB|
|0|first|184.4KB| | |

spark 1.2
||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
|12|saveAsTextFile| |1170.2KB| |
|11|combineByKey| |1264.5KB|1275.0KB|
|8|sortByKey| |1273.6KB| |
|7|mapPartitions| |134.5MB|1383.1KB|
|5|zipWithIndex| |132.5MB| |
|4|sortBy|155.6MB| |146.9MB|
|3|sortBy|155.6MB| | |
|2|collect| |2.0MB| |
|1|mapValues|155.6MB| |2.2MB|
|0|first|184.4KB| | |

  was:
The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is 
changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 
39.7GB in spark 1.1 but 91.0GB in spark 1.2.


> Shuffle write increases
> -----------------------
>
>                 Key: SPARK-5081
>                 URL: https://issues.apache.org/jira/browse/SPARK-5081
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>            Reporter: Kevin Jung
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to