Pradeep Kamath commented on PIG-1014:

The issue I see is with the implementation of COUNT today. It looks at only the 
first field in the bag and counts only non null values towards the result. This 
can lead to mysterious results. Consider a relation (A) with two fields with 
the following contents:
1 2
3 4
null 6
7 null
null null

If we have the following snippet:
B = group A all;
C = foreach B generate COUNT(A);

The answer is 3 which was arrived at only by considering record 1, record 2 and 
record 4 since the other records have null in the first position. Ironically 
though record 4 has null in the second position that does not prevent it from 
being not counted. So the result being based on the null-ness of just the first 
field seems somewhat arbitrary. My concern is that most users would not know 
that the result was arrived at *after* dropping records which had null in the 
first field even though they did not specify COUNT(A.$0).  Status Quo means we 
equate COUNT(A) to COUNT(A.$0) which is also not apparent to users.

> Pig should convert COUNT(relation) to COUNT_STAR(relation) so that all 
> records are counted without considering nullness of the fields in the records
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: PIG-1014
>                 URL: https://issues.apache.org/jira/browse/PIG-1014
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Pradeep Kamath

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to