[GitHub] spark pull request: Fix JIRA-983 and support exteranl sort for sor...

mateiz Wed, 16 Jul 2014 20:24:18 -0700

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/931#issuecomment-49254607
  
    Hey, so I rebased this PR and made it mergeable in my own branch, 
https://github.com/mateiz/spark/tree/spark-931. However, in doing this I 
realized that there might be some problems here that are fundamental.
    
    The main issue is that AppendOnlyMap, and ExternalAppendOnlyMap, require 
there's only one value for each key. The in-memory AOM will be very inefficient 
otherwise, and the EAOM depends on it. This means that for sort, we have to 
create (Key, ArrayBuffer[Value]) pairs, which will consume more memory by 
default than our in-memory sort, and will make us crash if there are too many 
identical values (something we do today but which may happen sooner here). Thus 
it seems that long-term we need a very different solution here, basically an 
external merge-sort. 
    
    A second, possibly less serious issue is that the changes to EAOM to use 
comparator.compare instead of hash code equality make it less efficient in the 
default hashing-based case, because instead of saving one key's hash code in an 
Int and reusing it to compare with other keys in various places, we always 
recompute it when we compare each pair of elements.
    
    For these reasons I'd actually hold off on merging this (even my merged 
version) until we implement an external merge-sort as part of sort-based 
shuffle. Then we can use that data structure here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Fix JIRA-983 and support exteranl sort for sor...

Reply via email to