[ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-514: ------------------------------- Attachment: PIG-514.patch Attached patch which implements the proposed design. The changes are spread across the following areas: - Parser - in QueryParser.jjt, the condition wherein a relational operator is followed by a Project(*) now marks the Project to be a special Project which would send empty bags to the predecessor on EOP - In the LogToPhyTranslationVisitor, based on the presence/absence of the above flag either a PORelationToExprProject is created or a regular POProject is created. - PORelationToExprProject is extended from POProject and only overrides the getNext(DataBag) method to send an empty bag on first encountering an EOP and sets state to send an EOP down the next time it is called. However if the POForEach in which this project is present, starts a new set of inputs, this flag is reset in the reset() method - PhysicalOperator now has a reset() method for use in the PORelationToExprProject and in limit/sort/distinct operators when limit is present to reset state when new input for the foreach starts. - The builtins (SUM/COUNT/MIN/MAX/AVG) now handle empty bags - COUNT gives 0 and the others give null as output (this change includes the type specific implementations of these aggs like IntSum, LongSum etc - Test cases to test the empty bag case > COUNT returns no results as a result of two filter statements in FOREACH > ------------------------------------------------------------------------ > > Key: PIG-514 > URL: https://issues.apache.org/jira/browse/PIG-514 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.2.0 > Reporter: Viraj Bhat > Assignee: Pradeep Kamath > Attachments: mystudentfile.txt, PIG-514.patch > > > For the following piece of sample code in FOREACH which counts the filtered > student records based on record_type == 1 and scores and also on record_type > == 0 does not seem to return any results. > {code} > mydata = LOAD 'mystudentfile.txt' AS (record_type,name,age,scores,gpa); > --keep only what we need > mydata_filtered = FOREACH mydata GENERATE record_type, name, age, > scores ; > --group > mydata_grouped = GROUP mydata_filtered BY (record_type,age); > myfinaldata = FOREACH mydata_grouped { > myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; > myfilter2 = FILTER mydata_filtered BY record_type == 0; > GENERATE FLATTEN(group), > -- Only this count causes the problem ?? > COUNT(myfilter1) as col2, > SUM(myfilter2.scores) as col3, > COUNT(myfilter2) as col4; }; > --these set of statements confirm that the count on the filters returns 1 > --mycountdata = FOREACH mydata_grouped > --{ > -- myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == > scores; > -- GENERATE > -- COUNT(myfilter1) as colcount; > --}; > --dump mycountdata; > dump myfinaldata; > {code} > But if you uncomment the {code} COUNT(myfilter1) as col2, {code}, it seems > to work with the following results.. > (0,22,45.0,2L) > (0,24,133.0,6L) > (0,25,22.0,1L) > Also I have tried to verify if this is a issue with the {code} > COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the > case. > If {code} dump mycountdata; {code} is uncommented it returns: > (1L) > (1L) > I am attaching the tab separated 'mystudentfile.txt' file used in this Pig > script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on > these filters?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.