[ 
https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-739:
---------------------------

    Attachment: filter_distinctbug.pig
                testdata

Testdata and Pig script

> Filter in foreach seems to drop records resulting in decreased count of 
> records
> -------------------------------------------------------------------------------
>
>                 Key: PIG-739
>                 URL: https://issues.apache.org/jira/browse/PIG-739
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.3.0
>            Reporter: Viraj Bhat
>             Fix For: 0.3.0
>
>         Attachments: filter_distinctbug.pig, testdata
>
>
> I have a Pig script in which I count the number of distinct records resulting 
> from the filter, this statement is embedded in a foreach. The number of 
> records I get with alias  TESTDATA_AGG_2 is 1.
> {code}
> TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
> testid:chararray, userid: chararray, sessionid:chararray, value:long, 
> flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and 
> timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> TESTDATA_AGG = foreach TESTDATA_GROUP {
>                         A = filter TESTDATA_FILTERED by (userid eq sessionid);
>                         C = distinct A.userid;
>                         generate group as testid, COUNT(TESTDATA_FILTERED) as 
> counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as 
> total_flags;
>                 }
> TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
> -- count records generated through nested foreach which contains distinct
> TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
> --explain TESTDATA_AGG_2;
> dump TESTDATA_AGG_2;
> --RESULT (1L)
> {code}
> But when I do the counting of records without the filter and distinct in the 
> foreach I get a different value (20L)
> {code}
> TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, 
> testid:chararray, userid: chararray, sessionid:chararray, value:long, 
> flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and 
> timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> -- count records generated through simple foreach
> TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, 
> COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as 
> total_flags;
> TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
> TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
> dump TESTDATA_AGG2_2;
> --RESULT (20L)
> {code}
> Attaching testdata

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to