This is exactly what I need, thanks! I had checked piggybank previously, but
didn't catch these the first time.

On Wed, Nov 18, 2009 at 1:15 PM, zaki rahaman <[email protected]>wrote:

> Hey Bill,
>
> If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in
> the contrib dir of your pig installation, you'll find several functions
> that
> might help. I haven't used any myself, but in
> org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader.
> If you pass a reg exp with capturing groups I believe you can simply use
> these functions directly. There are also apache log specific load funcs, I
> think theres Common and Combined Log Loaders... simply set up your scripts
> to use those functions to load your input data and you'll have what you
> need
> I believe.
>
>
>
> On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan
> <[email protected]>wrote:
>
> >
> >
> > You are right, there is no ordering of tuples within a bag by default
> > (except in some cases - like output of ORDER BY).
> >
> > For the specific purpose of pulling the date field - you could just use
> > some regexp udf instead of tokenize to pick the value you are interested
> in.
> >
> > There should be udf's in piggy bank which do this ...
> >
> >
> >
> > Or is this a more general question regarding accessing tuples within a
> bag
> > in some ordered fashion ?
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> > Bill Graham wrote:
> >
> >> Hi,
> >>
> >> I'm struggling to get the tokens out of a bag of tuples created by the
> >> TOKENIZE UDF and could use some help. I want to tokenize and then be
> able
> >> to
> >> reference the tokens by their position. Is this even possible? Since the
> >> token count is non-deterministic, I'm question whether I can use
> >> positional
> >> parameters to dig them out.
> >>
> >> Anyway, here's what I'm doing, starting with a chararray where each:
> >>
> >> grunt> describe B;
> >> B: {body: chararray}
> >> grunt> dump B;
> >> (2009-11-18 09:32:43,000 color=blue)
> >> (2009-11-18 09:32:43,000 color=red)
> >> (2009-11-18 09:32:44,000 color=red)
> >> (2009-11-18 09:32:45,000 color=green)
> >>
> >> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as
> >> B1:bag{T1:tuple(T:chararray)};
> >> grunt> describe C;
> >> C: {B1: {T1: (T: chararray)}}
> >>
> >> grunt> D = FOREACH C GENERATE B1.$0 as date;
> >> grunt> describe D;
> >> D: {date: {T: chararray}}
> >>
> >> grunt> dump D;
> >> ...
> >> ({(2009-11-18),(09:32:43),(000),(color=blue)})
> >> ({(2009-11-18),(09:32:43),(000),(color=red)})
> >> ({(2009-11-18),(09:32:44),(000),(color=red)})
> >> ({(2009-11-18),(09:32:45),(000),(color=green)})
> >>
> >> What I'd expect to see is just the date values.
> >>
> >> Any ideas?
> >>
> >> thanks,
> >> Bill
> >>
> >
> >
>
>
> --
> Zaki Rahaman
>

Reply via email to