Re: UDF with two Bag one per group and one 'static'

hc busy Fri, 30 Apr 2010 10:51:44 -0700

But we don't want to extend PigLatin to have #define... ?

On Fri, Apr 30, 2010 at 10:04 AM, Dmitriy Ryaboy <[email protected]> wrote:


> http://www.stringtemplate.org/
>
> On Fri, Apr 30, 2010 at 9:57 AM, hc busy <[email protected]> wrote:
> > Is there a Java preprocessor?
> >
> > On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >
> >> I don't think there's a need to reinvent, or reimplement, the wheel
> here.
> >>
> >> You are just talking about templates. Try http://template-toolkit.org/
> >> (or any of the ruby / python variants on the theme).
> >>
> >> Or the ruby Oink DSL.
> >>
> >> -D
> >>
> >> On Fri, Apr 30, 2010 at 9:45 AM, hc busy <[email protected]> wrote:
> >> > Sometimes, I find it necessary to project before performing the group
> by.
> >> > Because there isn't support for functions or #def's it's not possible
> to
> >> > pass in which column to group by, except to project before grouping.
> >> >
> >> > A = LOAD 'a' AS (group, value);
> >> > B = LOAD 'b';
> >> > B2 = foreach B generate $5 as group, *;
> >> > G = GROUP A BY group, *B2 BY group*;
> >> > R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));
> >> >
> >> > Wouldn't introducing #define in pig speed this up? Add a preprocessor
> >> > similar to the parameter substitution to support basic #define would
> be
> >> > cool.
> >> >
> >> > #define JordiGroup(t1, t2, f1, f2){
> >> >           G = group t1 by f1, t2 by f2;
> >> >           FOREACH G GENERATE FLATTEN(my.udf(t1,t2));
> >> >
> >> > }
> >> >
> >> > ... and later on
> >> >
> >> > R = JordiGroup(A, B, group, $5);
> >> >
> >> > Where the result of the #define is the last line; The implementation
> >> would
> >> > have a really simple parser to ensure () [] and {}'s match for blocks
> >> > starting with '#define'. Then it performs substitution in order the
> >> macro's
> >> > appear, no recursion is allowed.
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates <[email protected]>
> wrote:
> >> >
> >> >> You need to change your group to a cogroup so that both bags are in
> your
> >> >> data stream.  If you don't want to group bag b by the same keys as a
> >> (that
> >> >> is, you want all of b available for each group of a) then you can
> load b
> >> as
> >> >> a side file inside your udf.
> >> >>
> >> >> Alan.
> >> >>
> >> >>
> >> >> On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:
> >> >>
> >> >>  Hi,
> >> >>>
> >> >>> I've developed an UDF that receives two bags as inputs and outputs
> one
> >> >>> bag.
> >> >>>
> >> >>> One of the bags is different in every group and the other is always
> the
> >> >>> same.
> >> >>>
> >> >>> Example code:
> >> >>>
> >> >>> A = LOAD 'a' AS (group, value);
> >> >>> B = LOAD 'b';
> >> >>> G = GROUP A BY group;
> >> >>> R = FOREACH G GENERATE FLATTEN(my.udf(A,B));
> >> >>>
> >> >>> This give an error "Error during parsing. Invalid alias: B".
> >> >>> I can understand this error, but I cannot realize another
> >> >>> way to do this.
> >> >>>
> >> >>> Do you know which is the best way to do this?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> --
> >> >>> a10! i fins aviat.
> >> >>> J:-Deu
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: UDF with two Bag one per group and one 'static'

Reply via email to