This is exactly what I need, thanks! I had checked piggybank previously, but didn't catch these the first time.
On Wed, Nov 18, 2009 at 1:15 PM, zaki rahaman <[email protected]>wrote: > Hey Bill, > > If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in > the contrib dir of your pig installation, you'll find several functions > that > might help. I haven't used any myself, but in > org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader. > If you pass a reg exp with capturing groups I believe you can simply use > these functions directly. There are also apache log specific load funcs, I > think theres Common and Combined Log Loaders... simply set up your scripts > to use those functions to load your input data and you'll have what you > need > I believe. > > > > On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan > <[email protected]>wrote: > > > > > > > You are right, there is no ordering of tuples within a bag by default > > (except in some cases - like output of ORDER BY). > > > > For the specific purpose of pulling the date field - you could just use > > some regexp udf instead of tokenize to pick the value you are interested > in. > > > > There should be udf's in piggy bank which do this ... > > > > > > > > Or is this a more general question regarding accessing tuples within a > bag > > in some ordered fashion ? > > > > > > Regards, > > Mridul > > > > > > > > > > Bill Graham wrote: > > > >> Hi, > >> > >> I'm struggling to get the tokens out of a bag of tuples created by the > >> TOKENIZE UDF and could use some help. I want to tokenize and then be > able > >> to > >> reference the tokens by their position. Is this even possible? Since the > >> token count is non-deterministic, I'm question whether I can use > >> positional > >> parameters to dig them out. > >> > >> Anyway, here's what I'm doing, starting with a chararray where each: > >> > >> grunt> describe B; > >> B: {body: chararray} > >> grunt> dump B; > >> (2009-11-18 09:32:43,000 color=blue) > >> (2009-11-18 09:32:43,000 color=red) > >> (2009-11-18 09:32:44,000 color=red) > >> (2009-11-18 09:32:45,000 color=green) > >> > >> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as > >> B1:bag{T1:tuple(T:chararray)}; > >> grunt> describe C; > >> C: {B1: {T1: (T: chararray)}} > >> > >> grunt> D = FOREACH C GENERATE B1.$0 as date; > >> grunt> describe D; > >> D: {date: {T: chararray}} > >> > >> grunt> dump D; > >> ... > >> ({(2009-11-18),(09:32:43),(000),(color=blue)}) > >> ({(2009-11-18),(09:32:43),(000),(color=red)}) > >> ({(2009-11-18),(09:32:44),(000),(color=red)}) > >> ({(2009-11-18),(09:32:45),(000),(color=green)}) > >> > >> What I'd expect to see is just the date values. > >> > >> Any ideas? > >> > >> thanks, > >> Bill > >> > > > > > > > -- > Zaki Rahaman >
