[GitHub] spark pull request: [WIP][SPARK-7081]Faster sort-based shuffle pat...

JoshRosen Sun, 10 May 2015 17:19:59 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/5868#issuecomment-100720672
  
    Unfortunately, it looks like Snappy and LZ4 don't support concatenation of 
compressed streams, meaning that my nice zero-copy-io tricks for fast merging 
of spills won't be able to work unless LZF is used as the shuffle compression 
codec (see f780fb1c19498246c1de3a86e8e7816359bf4069 for some test cases).  I 
don't think that we can specify compression codecs on a per-shuffle-block 
basis, so I think we're stuck obeying the user's choice of compression codec 
unless we're willing to refactor the read path to accommodate multiple 
decompression codecs, etc (I think that fix is too risky for 1.4, since it 
involves modifying existing code, whereas all of the other changes up to this 
point have been purely additive and are fully feature-flagged).
    
    The actual decompression of shuffle blocks takes place at a fairly 
low-level inside of the block transfer / storage layers, so it's kind of hard 
to override in a custom ShuffleReader.  I might just have to abandon the idea 
of having an accelerated merging procedure or feature-flag it so that it only 
applies when LZF is used.  This is a shame since it really speeds up the merge.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-7081]Faster sort-based shuffle pat...

Reply via email to