[ https://issues.apache.org/jira/browse/TEZ-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116430#comment-17116430 ]
Ashutosh Chauhan commented on TEZ-4069: --------------------------------------- +1 > Avoid repeated computation of preferred locations in split grouping. > -------------------------------------------------------------------- > > Key: TEZ-4069 > URL: https://issues.apache.org/jira/browse/TEZ-4069 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.9.2 > Reporter: Oliver Draese > Priority: Major > Attachments: TEZ-4069.1.patch, TEZ-4069.patch > > > The TezSplitGrouper iterates through the list of splits multiple times, when > trying to group the splits (see getGroupedSplits). Each time, it asks the > locationProvider to return the array of preferred locations for the splits. > This has two side effects: > * generating the list of preferred locations can cause some CPU overhead > (i.e. calculating the consistent hash in HostAffinitySplitLocationProvider), > which can be avoided > * if the list of preferred location is changing between the different loops > of getGroupedSplits, we might encounter a NullPointerException. This happens > if a new location appears, that was not part of the initial set of locations > when populating the distinctLocations map. > The getGroupedSplits should query the preferred locations only once (for each > split) via the location provider and then memorize these instead of asking > the location provider repeatedly. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)