Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/5868#issuecomment-101837095
Here are some performance benchmark results based on a microbenchmark that
I ran locally on my laptop. This benchmark measures the time for a single map
task to write its shuffle output. The data type is (Int, Int) key-value pairs
and I used 1000 shuffle output partitions. The source for this benchmark can
be found at https://gist.github.com/JoshRosen/15ab99014fb21cbd1e65
In this benchmark, `sort_serialize` is the SortShuffleManager with
https://github.com/apache/spark/pull/4450's optimizations enabled. I also
tried benchmarking against Spark 1.2's sort-based shuffle but gave up because
it quickly ran into GC thrashing problems (even with spilling enabled).
Each result is the average of 5 runs, preceded by 3 warmup runs.
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="4" halign="left">mean</th>
<th colspan="4" halign="left">std</th>
</tr>
<tr>
<th></th>
<th colspan="4" halign="left">timeMs</th>
<th colspan="4" halign="left">timeMs</th>
</tr>
<tr>
<th>numKeys</th>
<th>10000000</th>
<th>20000000</th>
<th>30000000</th>
<th>40000000</th>
<th>10000000</th>
<th>20000000</th>
<th>30000000</th>
<th>40000000</th>
</tr>
<tr>
<th>mode</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>sort_serialize</th>
<td> 5057.6</td>
<td> 17871.2</td>
<td> 27368.0</td>
<td> 38202</td>
<td> 174.925127</td>
<td> 270.966234</td>
<td> 2209.958371</td>
<td> 1820.343237</td>
</tr>
<tr>
<th>tungsten-sort</th>
<td> 4207.6</td>
<td> 10720.6</td>
<td> 22414.4</td>
<td> 23966</td>
<td> 643.662800</td>
<td> 896.888678</td>
<td> 5501.487917</td>
<td> 4143.371514</td>
</tr>
</tbody>
</table>
To help interpret these results, here's the corresponding table that tracks
the number of disk bytes spilled:
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="4" halign="left">mean</th>
</tr>
<tr>
<th></th>
<th colspan="4" halign="left">diskBytesSpilled</th>
</tr>
<tr>
<th>numKeys</th>
<th>10000000</th>
<th>20000000</th>
<th>30000000</th>
<th>40000000</th>
</tr>
<tr>
<th>mode</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>sort_serialize</th>
<td> 0</td>
<td> 113607985</td>
<td> 113607985</td>
<td> 229053192</td>
</tr>
<tr>
<th>tungsten-sort</th>
<td> 0</td>
<td> 113599278</td>
<td> 113599284</td>
<td> 229043276</td>
</tr>
</tbody>
</table>
As I would expect, both shuffle implementations perform similarly when data
fits in memory. I think that `UnsafeShuffleManager`'s optimized shuffle merge
path explains the large performance gains for shuffles that trigger disk
spilling. Most of the variability here probably comes from my laptop's SSD.
I plan to run more comprehensive benchmarks on EC2, where we'll be able to
see the benefits of using multiple disks and larger JVM heaps.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]