Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5868#issuecomment-101837095
  
    Here are some performance benchmark results based on a microbenchmark that 
I ran locally on my laptop. This benchmark measures the time for a single map 
task to write its shuffle output. The data type is (Int, Int) key-value pairs 
and I used 1000 shuffle output partitions.  The source for this benchmark can 
be found at https://gist.github.com/JoshRosen/15ab99014fb21cbd1e65
    
    In this benchmark, `sort_serialize` is the SortShuffleManager with 
https://github.com/apache/spark/pull/4450's optimizations enabled.  I also 
tried benchmarking against Spark 1.2's sort-based shuffle but gave up because 
it quickly ran into GC thrashing problems (even with spilling enabled).
    
    Each result is the average of 5 runs, preceded by 3 warmup runs.
    
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th colspan="4" halign="left">mean</th>
          <th colspan="4" halign="left">std</th>
        </tr>
        <tr>
          <th></th>
          <th colspan="4" halign="left">timeMs</th>
          <th colspan="4" halign="left">timeMs</th>
        </tr>
        <tr>
          <th>numKeys</th>
          <th>10000000</th>
          <th>20000000</th>
          <th>30000000</th>
          <th>40000000</th>
          <th>10000000</th>
          <th>20000000</th>
          <th>30000000</th>
          <th>40000000</th>
        </tr>
        <tr>
          <th>mode</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>sort_serialize</th>
          <td> 5057.6</td>
          <td> 17871.2</td>
          <td> 27368.0</td>
          <td> 38202</td>
          <td> 174.925127</td>
          <td> 270.966234</td>
          <td> 2209.958371</td>
          <td> 1820.343237</td>
        </tr>
        <tr>
          <th>tungsten-sort</th>
          <td> 4207.6</td>
          <td> 10720.6</td>
          <td> 22414.4</td>
          <td> 23966</td>
          <td> 643.662800</td>
          <td> 896.888678</td>
          <td> 5501.487917</td>
          <td> 4143.371514</td>
        </tr>
      </tbody>
    </table>
    
    To help interpret these results, here's the corresponding table that tracks 
the number of disk bytes spilled:
    
    <table border="1" class="dataframe">
      <thead>
        <tr>
          <th></th>
          <th colspan="4" halign="left">mean</th>
        </tr>
        <tr>
          <th></th>
          <th colspan="4" halign="left">diskBytesSpilled</th>
        </tr>
        <tr>
          <th>numKeys</th>
          <th>10000000</th>
          <th>20000000</th>
          <th>30000000</th>
          <th>40000000</th>
        </tr>
        <tr>
          <th>mode</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>sort_serialize</th>
          <td> 0</td>
          <td> 113607985</td>
          <td> 113607985</td>
          <td> 229053192</td>
        </tr>
        <tr>
          <th>tungsten-sort</th>
          <td> 0</td>
          <td> 113599278</td>
          <td> 113599284</td>
          <td> 229043276</td>
        </tr>
      </tbody>
    </table>
    
    As I would expect, both shuffle implementations perform similarly when data 
fits in memory.  I think that `UnsafeShuffleManager`'s optimized shuffle merge 
path explains the large performance gains for shuffles that trigger disk 
spilling.  Most of the variability here probably comes from my laptop's SSD.
    
    I plan to run more comprehensive benchmarks on EC2, where we'll be able to 
see the benefits of using multiple disks and larger JVM heaps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to