[
https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-97:
--------------------------
Patch Info: [Patch Available]
> Jobs produce wrong results when a cogroup is in the script and the compiler
> chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-97
> URL: https://issues.apache.org/jira/browse/PIG-97
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.0.0
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should
> produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.
> However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) { // XXX: shouldn't we only do this if
> INNER flag is set?
> if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only
> one of the two files and thus there is nothing in one of the bags, so the
> above lines of code cause the combiner to bailout without pushing any tuples
> to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in
> cases where more than one file are grouped together.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.