[
https://issues.apache.org/jira/browse/TEZ-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842796#comment-16842796
]
Jonathan Eagles commented on TEZ-4069:
--------------------------------------
[~odraese], I don't have enough knowledge about the
HostAffinitySplitLocationProvider to make a decision about this. Could you help
me understand what product and usage? In your experience, how much CPU savings
(versus memory cost) are we see. How many splits?
> Avoid repeated computation of preferred locations in split grouping.
> --------------------------------------------------------------------
>
> Key: TEZ-4069
> URL: https://issues.apache.org/jira/browse/TEZ-4069
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.9.2
> Reporter: Oliver Draese
> Priority: Major
> Attachments: TEZ-4069.1.patch, TEZ-4069.patch
>
>
> The TezSplitGrouper iterates through the list of splits multiple times, when
> trying to group the splits (see getGroupedSplits). Each time, it asks the
> locationProvider to return the array of preferred locations for the splits.
> This has two side effects:
> * generating the list of preferred locations can cause some CPU overhead
> (i.e. calculating the consistent hash in HostAffinitySplitLocationProvider),
> which can be avoided
> * if the list of preferred location is changing between the different loops
> of getGroupedSplits, we might encounter a NullPointerException. This happens
> if a new location appears, that was not part of the initial set of locations
> when populating the distinctLocations map.
> The getGroupedSplits should query the preferred locations only once (for each
> split) via the location provider and then memorize these instead of asking
> the location provider repeatedly.
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)