Generally speaking, the second option will be more performant as it might
let you drop column a3 early. In most cases the magnitude of this is likely
to be very small as COUNT is an algebraic function, so most of the work is
done map-side anyway, and only partial, pre-aggregated counts are shipped
from mappers to reducers. However, if A is very wide, or a column store, or
has non-negligible deserialization cost that can be offset by only
deserializing a few fields -- the second option is better.

-D

On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes <cor...@tynt.com> wrote:

> Wondering about performance and count...
> A =  load 'test.csv' as (a1,a2,a3);
> B = GROUP A by a1;
> -- which preferred?
> C = FOREACH B GENERATE COUNT(A);
> -- or would this only send a single field through the COUNT and be more
> performant?
> C = FOREACH B GENERATE COUNT(A.a2);
>
>
>

Reply via email to