Pradeep Kamath updated PIG-514:

    Attachment: PIG-514.patch

Attached patch which implements the proposed design. The changes are spread 
across the following areas:
- Parser - in QueryParser.jjt, the condition wherein a relational operator is 
followed by a Project(*) now marks the Project to be a special Project which 
would send empty bags to the predecessor on EOP
- In the LogToPhyTranslationVisitor, based on the presence/absence of the above 
flag either a PORelationToExprProject is created or a regular POProject is 
- PORelationToExprProject is extended from POProject and only overrides the 
getNext(DataBag) method to send an empty bag on first encountering an EOP and 
sets state to send an EOP down the next time it is called. However if the 
POForEach in which this project is present, starts a new set of inputs, this 
flag is reset in the reset() method
- PhysicalOperator now has a reset() method for use in the 
PORelationToExprProject and in limit/sort/distinct operators when limit is 
present to reset state when new input for the foreach starts.
- The builtins (SUM/COUNT/MIN/MAX/AVG) now handle empty bags - COUNT gives 0 
and the others give null as output (this change includes the type specific 
implementations of these aggs like IntSum, LongSum etc
- Test cases to test the empty bag case

> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Viraj Bhat
>            Assignee: Pradeep Kamath
>         Attachments: mystudentfile.txt, PIG-514.patch
> For the following piece of sample code in FOREACH which counts the filtered 
> student records based on record_type == 1 and scores and also on record_type 
> == 0 does not seem to return any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  
> scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == 
> scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems 
> to work with the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} 
> COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the 
> case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig 
> script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on 
> these filters??

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to