[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

mateiz Sun, 20 Jul 2014 18:08:29 -0700

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1499#issuecomment-49565532
  
    I've now updated this to support partial aggregation across spilled files 
and even if we don't have an Ordering, using hash code comparison similar to 
ExternalAppendOnlyMap. It also now fully implements the behavior in the docs, 
namely sorting the data if you pass an Ordering, etc.
    
    It looks like Aaron found a problem with the size-tracking code -- will try 
to fix that in ExternalAppendOnlyMap as well. Once this class is in though it 
could replace ExternalAppendOnlyMap in most use cases, though its one downside 
is that it creates another object for each key of the in-memory collection 
(since we have `((Int, K), C)` pairs to allow sorting by partition). Might 
still be worth it long-term though. On the flip side, I think the hash based 
merging code here is more efficient, avoiding a bunch of `ArrayBuffer.remove` 
calls.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

Reply via email to