Split Combining drops splits with empty getLocations()
------------------------------------------------------

                 Key: PIG-2647
                 URL: https://issues.apache.org/jira/browse/PIG-2647
             Project: Pig
          Issue Type: Bug
          Components: impl
            Reporter: Alex Levenson


in:
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits
which is used by PigInputFormat

There is an assumption that every split's getLocations() will return a 
non-empty array.
If the following criteria are met:
1) Split combining is turned on
2) There is more than one split
3) There is at least one split that is smaller than the maxCombineSplitSize

splits with empty getLocations() will simply be dropped (ignored) without 
warning.

The hadoop API does not specify that all splits must return a location and 
there are cases where a split may want to return no locations (if the data is 
not in HDFS for example, or if the data is a directory full of HDFS files in 
which case there's not much gained by having locality)

This is due to the implementation of 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil#getCombinePigSplits
scans all splits eligible for combining and creates a map of Nodes -> splits, 
then laster iterates through the MAP (not the splits) to do the combining.

One solution would be to inject a dummy "empty node" into the map.

Overall the logic in getCombinePigSplits is very complicated and has a lot of 
edge cases, it might be worth cleaning up.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to