I think if you do COUNT(A), Pig will not realize it can ignore a2 and a3, and project all of them.
On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan <mrid...@yahoo-inc.com>wrote: > > I am not sure why second option is better - in both cases, you are shipping > only the combined counts from map to reduce. > On other hand, first could be better since it means we need to project only > 'a1' - and none of the other fields. > > Or did I miss something here ? > I am not very familiar to what pig does in this case right now. > > Regards, > Mridul > > > On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote: > >> Generally speaking, the second option will be more performant as it might >> let you drop column a3 early. In most cases the magnitude of this is >> likely >> to be very small as COUNT is an algebraic function, so most of the work is >> done map-side anyway, and only partial, pre-aggregated counts are shipped >> from mappers to reducers. However, if A is very wide, or a column store, >> or >> has non-negligible deserialization cost that can be offset by only >> deserializing a few fields -- the second option is better. >> >> -D >> >> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<cor...@tynt.com> wrote: >> >> Wondering about performance and count... >>> A = load 'test.csv' as (a1,a2,a3); >>> B = GROUP A by a1; >>> -- which preferred? >>> C = FOREACH B GENERATE COUNT(A); >>> -- or would this only send a single field through the COUNT and be more >>> performant? >>> C = FOREACH B GENERATE COUNT(A.a2); >>> >>> >>> >>> >