Re: UDF with two Bag one per group and one 'static'

Dmitriy Ryaboy Fri, 30 Apr 2010 09:55:06 -0700

I don't think there's a need to reinvent, or reimplement, the wheel here.

You are just talking about templates. Try http://template-toolkit.org/
(or any of the ruby / python variants on the theme).


Or the ruby Oink DSL.

-D

On Fri, Apr 30, 2010 at 9:45 AM, hc busy <[email protected]> wrote:
> Sometimes, I find it necessary to project before performing the group by.
> Because there isn't support for functions or #def's it's not possible to
> pass in which column to group by, except to project before grouping.
>
> A = LOAD 'a' AS (group, value);
> B = LOAD 'b';
> B2 = foreach B generate $5 as group, *;
> G = GROUP A BY group, *B2 BY group*;
> R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
>
> Wouldn't introducing #define in pig speed this up? Add a preprocessor
> similar to the parameter substitution to support basic #define would be
> cool.
>
> #define JordiGroup(t1, t2, f1, f2){
>           G = group t1 by f1, t2 by f2;
>           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
>
> }
>
> ... and later on
>
> R = JordiGroup(A, B, group, $5);
>
> Where the result of the #define is the last line; The implementation would
> have a really simple parser to ensure () [] and {}'s match for blocks
> starting with '#define'. Then it performs substitution in order the macro's
> appear, no recursion is allowed.
>
>
>
>
> On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <[email protected]> wrote:
>
>> You need to change your group to a cogroup so that both bags are in your
>> data stream.  If you don't want to group bag b by the same keys as a (that
>> is, you want all of b available for each group of a) then you can load b as
>> a side file inside your udf.
>>
>> Alan.
>>
>>
>> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>>
>>  Hi,
>>>
>>> I've developed an UDF that receives two bags as inputs and outputs one
>>> bag.
>>>
>>> One of the bags is different in every group and the other is always the
>>> same.
>>>
>>> Example code:
>>>
>>> A = LOAD 'a' AS (group, value);
>>> B = LOAD 'b';
>>> G = GROUP A BY group;
>>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>>>
>>> This give an error "Error during parsing. Invalid alias: B".
>>> I can understand this error, but I cannot realize another
>>> way to do this.
>>>
>>> Do you know which is the best way to do this?
>>>
>>> Thanks
>>>
>>> --
>>> a10! i fins aviat.
>>> J:-Deu
>>>
>>
>>
>

Re: UDF with two Bag one per group and one 'static'

Reply via email to