After split combination, the number of maps may vary slightly
-------------------------------------------------------------
Key: PIG-1757
URL: https://issues.apache.org/jira/browse/PIG-1757
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Yan Zhou
Priority: Minor
Fix For: 0.9.0
The split combination, introduced in 0.8 by PIG-1518, may see small variations
in number of maps. For instance, PigMix2's L4 query experiences a variation of
901 or 902 maps in a test cluster. The reason is that the BlockLocation's
getHosts
method, used in FileInputFormat's spli generation, returns a list of hosts that
hold the block. However the ordering of the list is not deterministic. Pig's
split combination is not immune to such a random ordering since the combination
decision is based upon the hosts that hold as many data local to a map as
possible, and there is no specific tie-breaking rule to force a particular
ordering. In some benchmarking or performance baselining tests, these
variations, however small they are, might not be desirable.
One solution is to sort the host lists from the component splits so as to get
consistent number of maps.
I suspect that other split combination techniques that make use of the data
host info to maximize the data locality in each map, like
CombineFileInputFormat, might have had the similar variations of number of
maps.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.