Re: COUNT(A.field1)

Mridul Muralidharan Thu, 26 Aug 2010 01:28:17 -0700


But it does for COUNT(A.a2) ?
That is interesting, and somehow weird :)


Thanks !
Mridul

On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

I think if you do COUNT(A), Pig will not realize it can ignore a2 and
a3, and project all of them.

On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
<mrid...@yahoo-inc.com <mailto:mrid...@yahoo-inc.com>> wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

        Generally speaking, the second option will be more performant as
        it might
        let you drop column a3 early. In most cases the magnitude of
        this is likely
        to be very small as COUNT is an algebraic function, so most of
        the work is
        done map-side anyway, and only partial, pre-aggregated counts
        are shipped
        from mappers to reducers. However, if A is very wide, or a
        column store, or
        has non-negligible deserialization cost that can be offset by only
        deserializing a few fields -- the second option is better.

        -D

        On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<cor...@tynt.com
        <mailto:cor...@tynt.com>>  wrote:

            Wondering about performance and count...
            A =  load 'test.csv' as (a1,a2,a3);
            B = GROUP A by a1;
            -- which preferred?
            C = FOREACH B GENERATE COUNT(A);
            -- or would this only send a single field through the COUNT
            and be more
            performant?
            C = FOREACH B GENERATE COUNT(A.a2);

Re: COUNT(A.field1)

Reply via email to