Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5868#issuecomment-101050094
  
    For folks following this patch, I think that this is very close to being 
ready to merge.  The size of the patch has grown a bit, but the majority of the 
additions are comments and test code.  I just updated `UnsafeShuffleManager` 
with a large [block 
comment](https://github.com/JoshRosen/spark/blob/unsafe-sort/core/src/main/scala/org/apache/spark/shuffle/unsafe/UnsafeShuffleManager.scala#L64)
 summarizing the current design.  I've also significantly increased the test 
coverage and added tests for as many different corner-cases as I could think 
of, including all combinations of shuffle compression codecs and spill merging 
routines, as well as tests for inserting excessively large records and tests to 
ensure that spill files are always cleaned up after errors.  These tests rely 
heavily on Mockito to mock out several Spark components, which is much nicer 
than trying to test these paths using SparkContext or by directly instantiating 
other components.
    
    If anyone has spare bug-hunting cycles, I'd love to know if there are any 
corner cases that I've overlooked and should test.  A fruitful source of 
potential bugs might be `ExternalSorterSuite`, since any bugs that were found 
and fixed there could have potentially resurfaced in this new implementation.  
I'll take a look through this list myself to see if I can find any good 
candidates.
    
    I'm going to look into fixing the shuffle metrics integration.  This is a 
little bit tricky because of some of the code re-use between the spilling and 
non-spilling write paths, but I think that it should be manageable if I write 
some good test cases.
    
    I've been doing some local performance profiling and have observed some 
large speedups, especially for workloads that write many spill files (in one 
case, I saw a 2x speedup vs. https://github.com/apache/spark/pull/4450, which 
itself is already much faster than 1.3's sort shuffle).  After I fix the 
metrics integration, I'll re-run my tests and post some perf. results here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to