I think if you do COUNT(A), Pig will not realize it can ignore a2 and a3,
and project all of them.

On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
<mrid...@yahoo-inc.com>wrote:

>
> I am not sure why second option is better - in both cases, you are shipping
> only the combined counts from map to reduce.
> On other hand, first could be better since it means we need to project only
> 'a1' - and none of the other fields.
>
> Or did I miss something here ?
> I am not very familiar to what pig does in this case right now.
>
> Regards,
> Mridul
>
>
> On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>
>> Generally speaking, the second option will be more performant as it might
>> let you drop column a3 early. In most cases the magnitude of this is
>> likely
>> to be very small as COUNT is an algebraic function, so most of the work is
>> done map-side anyway, and only partial, pre-aggregated counts are shipped
>> from mappers to reducers. However, if A is very wide, or a column store,
>> or
>> has non-negligible deserialization cost that can be offset by only
>> deserializing a few fields -- the second option is better.
>>
>> -D
>>
>> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<cor...@tynt.com>  wrote:
>>
>>  Wondering about performance and count...
>>> A =  load 'test.csv' as (a1,a2,a3);
>>> B = GROUP A by a1;
>>> -- which preferred?
>>> C = FOREACH B GENERATE COUNT(A);
>>> -- or would this only send a single field through the COUNT and be more
>>> performant?
>>> C = FOREACH B GENERATE COUNT(A.a2);
>>>
>>>
>>>
>>>
>

Reply via email to