[
https://issues.apache.org/jira/browse/PIG-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Bhat updated PIG-739:
---------------------------
Attachment: filter_distinctbug.pig
testdata
Testdata and Pig script
> Filter in foreach seems to drop records resulting in decreased count of
> records
> -------------------------------------------------------------------------------
>
> Key: PIG-739
> URL: https://issues.apache.org/jira/browse/PIG-739
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.3.0
> Reporter: Viraj Bhat
> Fix For: 0.3.0
>
> Attachments: filter_distinctbug.pig, testdata
>
>
> I have a Pig script in which I count the number of distinct records resulting
> from the filter, this statement is embedded in a foreach. The number of
> records I get with alias TESTDATA_AGG_2 is 1.
> {code}
> TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray,
> testid:chararray, userid: chararray, sessionid:chararray, value:long,
> flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and
> timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> TESTDATA_AGG = foreach TESTDATA_GROUP {
> A = filter TESTDATA_FILTERED by (userid eq sessionid);
> C = distinct A.userid;
> generate group as testid, COUNT(TESTDATA_FILTERED) as
> counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as
> total_flags;
> }
> TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
> -- count records generated through nested foreach which contains distinct
> TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
> --explain TESTDATA_AGG_2;
> dump TESTDATA_AGG_2;
> --RESULT (1L)
> {code}
> But when I do the counting of records without the filter and distinct in the
> foreach I get a different value (20L)
> {code}
> TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray,
> testid:chararray, userid: chararray, sessionid:chararray, value:long,
> flag:int);
> TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and
> timestamp lt '1230804000000' and value != 0);
> TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
> -- count records generated through simple foreach
> TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid,
> COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as
> total_flags;
> TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
> TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
> dump TESTDATA_AGG2_2;
> --RESULT (20L)
> {code}
> Attaching testdata
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.