charles du
Thu, 07 Aug 2008 14:55:50 -0700
Thanks. It works.
My concern right now is the performance. For example, I have 2 million
records that belongs to two types. If I want to count the number of records
for each type, I need group records based on the type as follows:
A = LOAD <my file> as (type, ...);
B = GROUP A BY type;
C = foreach A generate COUNT(A);
I notices it usually takes hadoop a long time to get the results back. My
experience with hadoop is that if there are a large number of values for a
key, hadoop is very slow on the reduce function. I understand it is a more
hadoop problem, instead of pig's. Do you guys know any way to speedup the
calculation?
Thanks.
Chuang
On Fri, Jul 18, 2008 at 2:10 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote:
> How was you bag created?
>
> Normally, you would load the data then group it into a bag using group
> by or group all and then apply the count:
>
> A = load 'input';
> B = group A all;
> C = foreach A generate COUNT(A);
>
> Olga
>
> > -----Original Message-----
> > From: charles du [EMAIL PROTECTED]
> > Sent: Friday, July 18, 2008 12:23 PM
> > To: pig-user@incubator.apache.org
> > Subject: how to get the size of a data bag
> >
> > Hi:
> >
> > Just start learning hadoop and pig latin. How can I get the
> > number of elements in a data bag?
> >
> > For example, a data bag like follow has four elements.
> > B= {1, 2, 3, 5}
> >
> > I tried C = COUNT(B), it did not work. Thanks.
> >
> > --
> > tp
> >
>
--
tp