Filter in foreach seems to drop records resulting in decreased count of records
-------------------------------------------------------------------------------
Key: PIG-739
URL: https://issues.apache.org/jira/browse/PIG-739
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Fix For: 0.3.0
I have a Pig script in which I count the number of distinct records resulting
from the filter, this statement is embedded in a foreach. The number of records
I get with alias TESTDATA_AGG_2 is 1.
{code}
TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray,
testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and
timestamp lt '1230804000000' and value != 0);
TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
TESTDATA_AGG = foreach TESTDATA_GROUP {
A = filter TESTDATA_FILTERED by (userid eq sessionid);
C = distinct A.userid;
generate group as testid, COUNT(TESTDATA_FILTERED) as
counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as
total_flags;
}
TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
-- count records generated through nested foreach which contains distinct
TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
--explain TESTDATA_AGG_2;
dump TESTDATA_AGG_2;
--RESULT (1L)
{code}
But when I do the counting of records without the filter and distinct in the
foreach I get a different value (20L)
{code}
TESTDATA = load 'testdata' using PigStorage() as (timestamp:chararray,
testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and
timestamp lt '1230804000000' and value != 0);
TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
-- count records generated through simple foreach
TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid,
COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as
total_flags;
TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
dump TESTDATA_AGG2_2;
--RESULT (20L)
{code}
Attaching testdata
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.