Sometimes, I find it necessary to project before performing the group by.
Because there isn't support for functions or #def's it's not possible to
pass in which column to group by, except to project before grouping.

A = LOAD 'a' AS (group, value);
B = LOAD 'b';
B2 = foreach B generate $5 as group, *;
G = GROUP A BY group, *B2 BY group*;
R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

Wouldn't introducing #define in pig speed this up? Add a preprocessor
similar to the parameter substitution to support basic #define would be
cool.

#define JordiGroup(t1, t2, f1, f2){
           G = group t1 by f1, t2 by f2;
           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

}

... and later on

R = JordiGroup(A, B, group, $5);

Where the result of the #define is the last line; The implementation would
have a really simple parser to ensure () [] and {}'s match for blocks
starting with '#define'. Then it performs substitution in order the macro's
appear, no recursion is allowed.




On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <[email protected]> wrote:

> You need to change your group to a cogroup so that both bags are in your
> data stream.  If you don't want to group bag b by the same keys as a (that
> is, you want all of b available for each group of a) then you can load b as
> a side file inside your udf.
>
> Alan.
>
>
> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
>
>  Hi,
>>
>> I've developed an UDF that receives two bags as inputs and outputs one
>> bag.
>>
>> One of the bags is different in every group and the other is always the
>> same.
>>
>> Example code:
>>
>> A = LOAD 'a' AS (group, value);
>> B = LOAD 'b';
>> G = GROUP A BY group;
>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
>>
>> This give an error "Error during parsing. Invalid alias: B".
>> I can understand this error, but I cannot realize another
>> way to do this.
>>
>> Do you know which is the best way to do this?
>>
>> Thanks
>>
>> --
>> a10! i fins aviat.
>> J:-Deu
>>
>
>

Reply via email to