pig-user  

RE: how to get the size of a data bag

Olga Natkovich
Thu, 07 Aug 2008 15:02:16 -0700

The way your query is formulated combiner is not called and that would
account for the slowness.

Try this:

A = LOAD <my file> as (type, ...);
B = GROUP  A  BY  type;
C = foreach A group, generate COUNT(A); 

You can check if combiner will be called by running

Explain C;

Olga

> -----Original Message-----
> From: charles du [EMAIL PROTECTED] 
> Sent: Thursday, August 07, 2008 2:55 PM
> To: pig-user@incubator.apache.org
> Subject: Re: how to get the size of a data bag
> 
> Thanks. It works.
> 
> My concern right now is the performance. For example, I have 
> 2 million records that belongs to two types. If I want to 
> count the number of records for each type, I need group 
> records based on the type as follows:
> 
> A = LOAD <my file> as (type, ...);
> B = GROUP  A  BY  type;
> C = foreach A generate COUNT(A);
> 
> I notices it usually takes hadoop a long time to get the 
> results back. My experience with hadoop is that if there are 
> a large number of values for a key, hadoop is very slow on 
> the reduce function. I understand it is a more hadoop 
> problem, instead of pig's. Do you guys know any way to 
> speedup the calculation?
> 
> 
> Thanks.
> 
> Chuang
> 
> On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich 
> <[EMAIL PROTECTED]> wrote:
> 
> > How was you bag created?
> >
> > Normally, you would load the data then group it into a bag 
> using group 
> > by or group all and then apply the count:
> >
> > A = load 'input';
> > B = group A all;
> > C = foreach A generate COUNT(A);
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: charles du [EMAIL PROTECTED]
> > > Sent: Friday, July 18, 2008 12:23 PM
> > > To: pig-user@incubator.apache.org
> > > Subject: how to get the size of a data bag
> > >
> > > Hi:
> > >
> > > Just start learning hadoop and pig latin. How can I get 
> the number 
> > > of elements in a data bag?
> > >
> > > For example, a data bag like follow has four elements.
> > >   B= {1, 2, 3, 5}
> > >
> > > I tried C = COUNT(B), it did not work. Thanks.
> > >
> > > --
> > > tp
> > >
> >
> 
> 
> 
> --
> tp
>