Hey Bill, If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in the contrib dir of your pig installation, you'll find several functions that might help. I haven't used any myself, but in org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader. If you pass a reg exp with capturing groups I believe you can simply use these functions directly. There are also apache log specific load funcs, I think theres Common and Combined Log Loaders... simply set up your scripts to use those functions to load your input data and you'll have what you need I believe.
On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan <[email protected]>wrote: > > > You are right, there is no ordering of tuples within a bag by default > (except in some cases - like output of ORDER BY). > > For the specific purpose of pulling the date field - you could just use > some regexp udf instead of tokenize to pick the value you are interested in. > > There should be udf's in piggy bank which do this ... > > > > Or is this a more general question regarding accessing tuples within a bag > in some ordered fashion ? > > > Regards, > Mridul > > > > > Bill Graham wrote: > >> Hi, >> >> I'm struggling to get the tokens out of a bag of tuples created by the >> TOKENIZE UDF and could use some help. I want to tokenize and then be able >> to >> reference the tokens by their position. Is this even possible? Since the >> token count is non-deterministic, I'm question whether I can use >> positional >> parameters to dig them out. >> >> Anyway, here's what I'm doing, starting with a chararray where each: >> >> grunt> describe B; >> B: {body: chararray} >> grunt> dump B; >> (2009-11-18 09:32:43,000 color=blue) >> (2009-11-18 09:32:43,000 color=red) >> (2009-11-18 09:32:44,000 color=red) >> (2009-11-18 09:32:45,000 color=green) >> >> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as >> B1:bag{T1:tuple(T:chararray)}; >> grunt> describe C; >> C: {B1: {T1: (T: chararray)}} >> >> grunt> D = FOREACH C GENERATE B1.$0 as date; >> grunt> describe D; >> D: {date: {T: chararray}} >> >> grunt> dump D; >> ... >> ({(2009-11-18),(09:32:43),(000),(color=blue)}) >> ({(2009-11-18),(09:32:43),(000),(color=red)}) >> ({(2009-11-18),(09:32:44),(000),(color=red)}) >> ({(2009-11-18),(09:32:45),(000),(color=green)}) >> >> What I'd expect to see is just the date values. >> >> Any ideas? >> >> thanks, >> Bill >> > > -- Zaki Rahaman
