Pradeep Kamath commented on PIG-514:

A proposal for fixing this is the following:

Relational operators (like filter) inside a foreach  which produce no result 
tuples will provide an empty bag as output. If this empty bag is input to a udf 
, the udf will receive an empty bag as its input argument.

To achieve the above semantics a change will be needed in POProject (this 
change could be factored into a new subclass of POProject if that is cleaner). 
Currently when the leaf ExpressionOperator of an inner plan in POForeach has a 
relational operator as its input, a POProject is introduced in between which 
takes tuples from the relational operator into a bag and provides the bag as 
input to the leaf ExpressionOperator. 
For example a COUNT() with filter as input in a foreach would look like this in 
the MR plan:
New For Each(false)[bag] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-53
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - pradeepk-Mon Mar 30 
20:21:50 PDT 2009-52
    |   |
    |   |---Project[bag][*] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-51
    |       |
    |       |---Filter[bag] - pradeepk-Mon Mar 30 20:21:50 PDT 2009-46

This POProject will need to maintain internal state to check if it receives no 
other input and only receives an EOP from its predecessor relational operator 
(say a POFilter). In such a case, it will need to send an empty bag as its 

Expected results for aggregates in such a case (say where a filter in the 
foreach filters away all records ) will be:
SUM, MIN, MAX, AVG - null


> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Viraj Bhat
>         Attachments: mystudentfile.txt
> For the following piece of sample code in FOREACH which counts the filtered 
> student records based on record_type == 1 and scores and also on record_type 
> == 0 does not seem to return any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  
> scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == 
> scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems 
> to work with the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} 
> COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the 
> case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig 
> script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on 
> these filters??

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to