Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/6652#issuecomment-110121649
  
    @shivaram and I discussed this offline and whether it was possible to 
incorporate @sryza's point about the fact that it only makes sense to do this 
when there's a lot of skew in the placement of the data read by one reduce 
task.  We came up with a simple approach that seems strictly better than the 
current one (it's simpler to implement and also avoids setting preferred 
locations when there isn't much skew in the output). The new idea is to set the 
preferred locations for a reducer to be any locations that hold >20% of the 
data to be read by that reducer.  This will result in at most 5 locations being 
set (the current approach always chooses 5 locations), but more typically 
fewer.  20% is totally a magic number here...but intuitively it seems 
unnecessary to set a preferred location and add that complexity when there will 
be less than a 20% perf. improvement from doing so.
    
    Unless anyone objects, @shivaram is going to push a new version that uses 
that approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to