Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5868#issuecomment-98893862
  
    I think we are compounding a few different things. This is my thoughts 
after looking at #3438, there are two scenarios:
    
    1. aggregation: I can't see any performance gains for aggregations based on 
the provided benchmark except one case. In some cases it become slower, but 
that's probably noise. The one case that #3438 wins over the existing approach 
is when GC happens a lot for the reduce-side hash table. This really just 
suggests that for high-cardinality aggregation, (external) sorting is better 
than hashing. This is  really useful and we should have it. I'm not sure about 
having it in shuffle itself.
    
    2. sortByKey: I agree that we should just sort by key on the map side for 
global ordering. However, it is less clear whether doing explicit merging on 
the reduce side would be much better than just doing full-sorting, since:
    
    (1) merging a large number of streams is very slow
    (2) merging a small number of streams ... TimSort already performs better 
on partially sorted data.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to