[
https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yan Zhou updated PIG-1757:
--------------------------
Attachment: PIG-1757.patch
test-core runs ok; test-patch is clean except for lack of test case which is ok
for this trivial change and difficulty to run on a local cluster.
> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
> Key: PIG-1757
> URL: https://issues.apache.org/jira/browse/PIG-1757
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Yan Zhou
> Assignee: Yan Zhou
> Priority: Minor
> Fix For: 0.9.0
>
> Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small
> variations in number of maps. For instance, PigMix2's L4 query experiences a
> variation of 901 or 902 maps in a test cluster. The reason is that the
> BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts
> that hold the block. However the ordering of the list is not deterministic.
> Pig's split combination is not immune to such a random ordering since the
> combination decision is based upon the hosts that hold as many data local to
> a map as possible, and there is no specific tie-breaking rule to force a
> particular ordering. In some benchmarking or performance baselining tests,
> these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get
> consistent number of maps.
> I suspect that other split combination techniques that make use of the data
> host info to maximize the data locality in each map, like
> CombineFileInputFormat, might have had the similar variations of number of
> maps.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.