Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/5868#issuecomment-100720672
Unfortunately, it looks like Snappy and LZ4 don't support concatenation of
compressed streams, meaning that my nice zero-copy-io tricks for fast merging
of spills won't be able to work unless LZF is used as the shuffle compression
codec (see f780fb1c19498246c1de3a86e8e7816359bf4069 for some test cases). I
don't think that we can specify compression codecs on a per-shuffle-block
basis, so I think we're stuck obeying the user's choice of compression codec
unless we're willing to refactor the read path to accommodate multiple
decompression codecs, etc (I think that fix is too risky for 1.4, since it
involves modifying existing code, whereas all of the other changes up to this
point have been purely additive and are fully feature-flagged).
The actual decompression of shuffle blocks takes place at a fairly
low-level inside of the block transfer / storage layers, so it's kind of hard
to override in a custom ShuffleReader. I might just have to abandon the idea
of having an accelerated merging procedure or feature-flag it so that it only
applies when LZF is used. This is a shame since it really speeds up the merge.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]