pig-user  

Re: more bagging fun

hc busy
Wed, 10 Mar 2010 10:15:39 -0800

An additional thought... we can define udf's like

ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}),
SQRT(bag{(float)})..

basically vectorize most of the common arithmetic operations, but then the
language has to support it by converting

bag.a + bag.b

to

ADD(bag.(a,b))

I guess there are some difficulties, for instance:

SQRT(bag.a)+bag.b

How would this work? because sqrt(bag.a) returns a bag, how would we convert
it to the correct per tuple operation? It's almost like we want to convert
an expression

SUM(SQRT(bag.a),bag.b)

into a function F such that

SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b)

and then the F is computed by iterating through on each tuple of the bag.

FOREACH ... GENERATE ..., F(bag.(a,b));






On Wed, Mar 10, 2010 at 9:31 AM, hc busy <hc.b...@gmail.com> wrote:

>
> So, pig team, what is the right way to accomplish this?
>
>
> On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan <
> mrid...@yahoo-inc.com> wrote:
>
>> On Tuesday 09 March 2010 04:13 AM, hc busy wrote:
>>
>>> okay. Here's the bag that I have:
>>>
>>>  {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1: int,
>>> number2:int}}
>>>
>>>
>>>
>>> and I want to do this
>>>
>>> grunt>  CALCULATE= FOREACH TABLE_group GENERATE group, SUM(TABLE.number1
>>> /
>>> TABLE.number2);
>>>
>>
>>
>> TABLE.number1 actually gives you the bag of number1 values found in TABLE
>> - but I am never really sure of the semantics in these situations since I am
>> slightly nervous that it is impl dependent ... my understanding is, what you
>> are attempting should not work, but I could be wrong.
>>
>> I do know that TABLE.(number1, number2) will consistently project and pair
>> up the fields : so to 'fix' this, you can write your own DIVIDE_SUM which
>> does something like this :
>>
>> grunt>  CALCULATE= FOREACH TABLE_group GENERATE group,
>> DIVIDE_SUM(TABLE.(number1 , number2));
>>
>> And DIVIDE_SUM udf impl takes in a bag with tuples containing schema
>> (numerator, denominator) : and returns :
>>
>> result == sum ( foreach tuple ( tuple.numerator / tuple.denominator ) );
>>
>>
>> Obviously, this is not as 'elegant' as your initial code and is definitely
>> more cumbersome ... so clarifying this behavior with someone from pig team
>> will definitely be better before you attempt this.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>> grunt>  DUMP CALCULATE;
>>>
>>> 2010-03-08 14:02:41,055 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1039: Incompatible types in Multiplication Operator left hand
>>> side:bag
>>> right hand side:bag
>>>
>>>
>>>
>>> This seems useful that I may want to calculate an agg. of some arithmetic
>>> operations on member of a bag. Any suggestions?
>>>
>>> ... Looking at the documentation it looks like I want to do something
>>> like
>>>
>>> SUM(TABLE.(number1 / number2))
>>>
>>> but that doesn't work either :-(
>>>
>>
>>
>