Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/931#issuecomment-46351611
  
    @xiajunluan I took a pass over the logic and it looks correct to me. I like 
your idea of not relying on the hash code anymore; this has caused us a few 
correctness problems in the past.
    
    In response to your second point about spilling within a key for sort, we 
have two alternatives: (1) add special handling for sorting in this PR as you 
suggest, or (2) add general handling to ExternalAppendOnlyMap. I think we 
should go with the latter, because shuffles in general also suffer for the same 
problem if we have too many values for one key. Once we fix that, now that 
sortByKey goes through this map, sorting will get that property for free as 
well. Also, deferring that change will keep this PR simple.
    
    As @mateiz mentioned, it would be good if you test the performance of a 
simple shuffle and/or join before and after your changes. The logic in 
`ExternalIterator` is actually quite sensitive because it is a super hot code 
path, so we should make sure it doesn't cause a regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to