I guess I'd take back some thoughts considering PIG is specially designed to 
produce m/r jobs. Unlike command line parameters or those specified by 
%declare, which wont change their values during the life of the whole job (may 
consist of multiple m/r tasks), variables that can change values from time to 
time do not fit in m/r scheme, which is good for those applications in which 
data once created are usually for read only. But a suggestion could be to allow 
to create variables and assign values to them only once and they carry the same 
values from the point they are assigned values to the end of the program, which 
means once a variable is assigned a value, it becomes immutable. ofcoz, even 
this will create some difficulty e.g. the difficulty for optimization since it 
may add extra data dependency ...


Michael

--- On Wed, 2/24/10, jiang licht <[email protected]> wrote:

From: jiang licht <[email protected]>
Subject: Re: count total number of tuples in a bag?
To: [email protected]
Date: Wednesday, February 24, 2010, 12:17 AM

If there are handy variables to carry values here and there, that'd be helpful 
:)

Thanks,

Michael

--- On Tue, 2/23/10, Jeff Zhang <[email protected]> wrote:

From: Jeff Zhang <[email protected]>
Subject: Re: count total number of tuples in a bag?
To: [email protected]
Date: Tuesday, February 23, 2010, 8:32 PM

One way I can think of is to store the total number of tuple in one
specified place, and then load in your UDF when you wan to use it.

a_all = group a ALL;
a_count = FOREACH a_all GENERATE COUNT(a);
store a_count into 'your_store_place';
.....................

d = foreach c generate YourUDF($0);




On Tue, Feb 23, 2010 at 4:28 PM, jiang licht <[email protected]> wrote:

> Thanks Dmitriy. That's not sth I want. I want sth just like that in SQL,
> you can get a number of total count of tuples (or other things of interest)
> and use that like a variable (sorry, I don't know if I should use variable
> here in PIG, but PIG passes command line parameter as a variable, right?).
> So, this variable will be convenient for quick calculation of statistics in
> PIG scripts. Though I also realize it might not be true to use a variable in
> this way in PIG. So, it might be a misconcept in my mind anyway...
>
> Thanks,
>
> Michael
>
> --- On Tue, 2/23/10, Dmitriy Ryaboy <[email protected]> wrote:
>
> From: Dmitriy Ryaboy <[email protected]>
> Subject: Re: count total number of tuples in a bag?
> To: [email protected]
> Date: Tuesday, February 23, 2010, 6:10 PM
>
> c = FOREACH b GENERATE group as key, COUNT(a);
>
> will give you the number of rows in a per key.
>
> a_all = group a ALL;
> a_count = FOREACH a_all GENERATE COUNT(a);
>
> will give you the total number of rows in a.
>
> Does that answer your question?
>
>
> On Tue, Feb 23, 2010 at 3:54 PM, jiang licht <[email protected]>
> wrote:
>
> > Excuse me I could have missed important part of PIG document and asked
> this
> > trivial question here :) What is the best way to find out the total
> number
> > of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
> > 'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
> > want to know how many tuples are loaded to 'a' and total number left in
> 'c'.
> > One way might be to use a udf function. But is there a support of
> counting
> > this in PIG?
> >
> > Thanks,
> >
> > Michael
> >
> >
> >
>
>
>
>
>



-- 
Best Regards

Jeff Zhang



      


      

Reply via email to