Oliver Draese created TEZ-4069:
----------------------------------

             Summary: Avoid repeated computation of preferred locations in 
split grouping.
                 Key: TEZ-4069
                 URL: https://issues.apache.org/jira/browse/TEZ-4069
             Project: Apache Tez
          Issue Type: Improvement
    Affects Versions: 0.9.2
            Reporter: Oliver Draese


The TezSplitGrouper iterates through the list of splits multiple times, when 
trying to group the splits (see getGroupedSplits). Each time, it asks the 
locationProvider to return the array of preferred locations for the splits. 
This has two side effects:
 * generating the list of preferred locations can cause some CPU overhead (i.e. 
calculating the consistent hash in HostAffinitySplitLocationProvider), which 
can be avoided
 * if the list of preferred location is changing between the different loops of 
getGroupedSplits, we might encounter a NullPointerException. This happens if a 
new location appears, that was not part of the initial set of locations when 
populating the distinctLocations map.

The getGroupedSplits should query the preferred locations only once (for each 
split) via the location provider and then memorize these instead of asking the 
location provider repeatedly.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to