Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/5868#issuecomment-101050094
For folks following this patch, I think that this is very close to being
ready to merge. The size of the patch has grown a bit, but the majority of the
additions are comments and test code. I just updated `UnsafeShuffleManager`
with a large [block
comment](https://github.com/JoshRosen/spark/blob/unsafe-sort/core/src/main/scala/org/apache/spark/shuffle/unsafe/UnsafeShuffleManager.scala#L64)
summarizing the current design. I've also significantly increased the test
coverage and added tests for as many different corner-cases as I could think
of, including all combinations of shuffle compression codecs and spill merging
routines, as well as tests for inserting excessively large records and tests to
ensure that spill files are always cleaned up after errors. These tests rely
heavily on Mockito to mock out several Spark components, which is much nicer
than trying to test these paths using SparkContext or by directly instantiating
other components.
If anyone has spare bug-hunting cycles, I'd love to know if there are any
corner cases that I've overlooked and should test. A fruitful source of
potential bugs might be `ExternalSorterSuite`, since any bugs that were found
and fixed there could have potentially resurfaced in this new implementation.
I'll take a look through this list myself to see if I can find any good
candidates.
I'm going to look into fixing the shuffle metrics integration. This is a
little bit tricky because of some of the code re-use between the spilling and
non-spilling write paths, but I think that it should be manageable if I write
some good test cases.
I've been doing some local performance profiling and have observed some
large speedups, especially for workloads that write many spill files (in one
case, I saw a 2x speedup vs. https://github.com/apache/spark/pull/4450, which
itself is already much faster than 1.3's sort shuffle). After I fix the
metrics integration, I'll re-run my tests and post some perf. results here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]