[ https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693981#action_12693981 ]
Pradeep Kamath commented on PIG-514: ------------------------------------ A proposal for fixing this is the following: Semantics: Relational operators (like filter) inside a foreach which produce no result tuples will provide an empty bag as output. If this empty bag is input to a udf , the udf will receive an empty bag as its input argument. Implementation: To achieve the above semantics a change will be needed in POProject (this change could be factored into a new subclass of POProject if that is cleaner). Currently when the leaf ExpressionOperator of an inner plan in POForeach has a relational operator as its input, a POProject is introduced in between which takes tuples from the relational operator into a bag and provides the bag as input to the leaf ExpressionOperator. For example a COUNT() with filter as input in a foreach would look like this in the MR plan: {noformat} New For Each(false)[bag] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-53 | | | POUserFunc(org.apache.pig.builtin.COUNT)[long] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-52 | | | |---Project[bag][*] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-51 | | | |---Filter[bag] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-46 {noformat} This POProject will need to maintain internal state to check if it receives no other input and only receives an EOP from its predecessor relational operator (say a POFilter). In such a case, it will need to send an empty bag as its output. Expected results for aggregates in such a case (say where a filter in the foreach filters away all records ) will be: COUNT - 0 SUM, MIN, MAX, AVG - null Comments/Thoughts? > COUNT returns no results as a result of two filter statements in FOREACH > ------------------------------------------------------------------------ > > Key: PIG-514 > URL: https://issues.apache.org/jira/browse/PIG-514 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.2.0 > Reporter: Viraj Bhat > Attachments: mystudentfile.txt > > > For the following piece of sample code in FOREACH which counts the filtered > student records based on record_type == 1 and scores and also on record_type > == 0 does not seem to return any results. > {code} > mydata = LOAD 'mystudentfile.txt' AS (record_type,name,age,scores,gpa); > --keep only what we need > mydata_filtered = FOREACH mydata GENERATE record_type, name, age, > scores ; > --group > mydata_grouped = GROUP mydata_filtered BY (record_type,age); > myfinaldata = FOREACH mydata_grouped { > myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores; > myfilter2 = FILTER mydata_filtered BY record_type == 0; > GENERATE FLATTEN(group), > -- Only this count causes the problem ?? > COUNT(myfilter1) as col2, > SUM(myfilter2.scores) as col3, > COUNT(myfilter2) as col4; }; > --these set of statements confirm that the count on the filters returns 1 > --mycountdata = FOREACH mydata_grouped > --{ > -- myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == > scores; > -- GENERATE > -- COUNT(myfilter1) as colcount; > --}; > --dump mycountdata; > dump myfinaldata; > {code} > But if you uncomment the {code} COUNT(myfilter1) as col2, {code}, it seems > to work with the following results.. > (0,22,45.0,2L) > (0,24,133.0,6L) > (0,25,22.0,1L) > Also I have tried to verify if this is a issue with the {code} > COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the > case. > If {code} dump mycountdata; {code} is uncommented it returns: > (1L) > (1L) > I am attaching the tab separated 'mystudentfile.txt' file used in this Pig > script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on > these filters?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.