Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/6652#issuecomment-110121649
@shivaram and I discussed this offline and whether it was possible to
incorporate @sryza's point about the fact that it only makes sense to do this
when there's a lot of skew in the placement of the data read by one reduce
task. We came up with a simple approach that seems strictly better than the
current one (it's simpler to implement and also avoids setting preferred
locations when there isn't much skew in the output). The new idea is to set the
preferred locations for a reducer to be any locations that hold >20% of the
data to be read by that reducer. This will result in at most 5 locations being
set (the current approach always chooses 5 locations), but more typically
fewer. 20% is totally a magic number here...but intuitively it seems
unnecessary to set a preferred location and add that complexity when there will
be less than a 20% perf. improvement from doing so.
Unless anyone objects, @shivaram is going to push a new version that uses
that approach.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]