Alan Gates commented on PIG-988:

Consider a script like:

A = load 'bla';
B = group A by $0;
C = foreach B {
       D = A.$1;
       E = distinct D;
       generate group, COUNT(E);

This is count distinct, and a fairly common thing to do.  Currently Pig will 
use the combiner to remove as many duplicate values from D as possible.  But a 
final distinct pass is still required on the reducer.  Currently DistinctBag is 
used for this.  In this particular case, it would be possible to instead use 
Hadoop's secondary sort to sort the incoming records on the full tuple, and 
then use a different implementation of DistinctBag that expected the incoming 
records to be sorted and remove duplicates.

Note that this could not be used in conjunction with the order by optimization 
proposed in PIG-980.

> Better implementation of distinct aggs
> --------------------------------------
>                 Key: PIG-988
>                 URL: https://issues.apache.org/jira/browse/PIG-988
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
> Distinct aggregates by definition cannot use the combiner (though the 
> distinct can be and is done in the combiner).  Since this is a common use 
> case it would be good to optimize.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to