[ 
https://issues.apache.org/jira/browse/PIG-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-97:
--------------------------

    Attachment: cogroupcombiner.patch

The attached patch turns off combiner in the case of cogrouping being used.  It 
also restores the POVisitor to work the way it did before the front-end 
back-end split was introduced (PIG-32).  I needed this to make explain work 
again, so I could see when the combiner was and wasn't being invoked.

Antonio, please take a look at my changes for the POVisitor and make sure it 
will work within the new split framework.

> Jobs produce wrong results when a cogroup is in the script and the compiler 
> chooses to use the combiner feature of hadoop.
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-97
>                 URL: https://issues.apache.org/jira/browse/PIG-97
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: cogroupcombiner.patch
>
>
> The following script will produce 0 output records, even when it should 
> produce records:
> a = load 'file1';
> b = load 'file2';
> c = cogroup a by $0, b by $0;
> d = foreach c generate $0, COUNT($1), COUNT($2);
> dump d;
> In this case pig chooses to use the combiner in order to be more efficient.  
> However, the following code in PigCombiner.java causes a problem:
> for (int i = 0; i < inputCount; i++) {  // XXX: shouldn't we only do this if 
> INNER flag is set?
>     if (t.getBagField(1 + i).size() == 0) return;
> }
> In this case a map is often running on a machine where it has access to only 
> one of the two files and thus there is nothing in one of the bags, so the 
> above lines of code cause the combiner to bailout without pushing any tuples 
> to the OutputCollector.
> The proposed solution for the short term is to disable use of the combiner in 
> cases where more than one file are grouped together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to