Hm, I may be wrong about this, but from what I recall, there are no 'fields' in the bag of tokens (and no ordering) created by TOKENIZE. As such, I don't think there's a way to accomplish what you're trying to do the way it's written. As an alternative approach, you might try using FLATTEN to unnest the TOKENIZE output and give you tuples for each token and then filter the tokens to those that match your date pattern. Alternatively, you could accomplish this in one step with a regex extract UDF (there's one in piggybank if I recall correctly and something similar in amazon's pig function jar). If the data you described below is your input data, then you could remove the projection step altogether by using a RegEx LoadFunc to get the date field. Hope this helps, and others feel free to correct me if I'm wrong, as I'm sure there's probably a better/more elegant way.
-- Zaki On Wed, Nov 18, 2009 at 3:03 PM, Bill Graham <[email protected]> wrote: > Hi, > > I'm struggling to get the tokens out of a bag of tuples created by the > TOKENIZE UDF and could use some help. I want to tokenize and then be able > to > reference the tokens by their position. Is this even possible? Since the > token count is non-deterministic, I'm question whether I can use positional > parameters to dig them out. > > Anyway, here's what I'm doing, starting with a chararray where each: > > grunt> describe B; > B: {body: chararray} > grunt> dump B; > (2009-11-18 09:32:43,000 color=blue) > (2009-11-18 09:32:43,000 color=red) > (2009-11-18 09:32:44,000 color=red) > (2009-11-18 09:32:45,000 color=green) > > grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as > B1:bag{T1:tuple(T:chararray)}; > grunt> describe C; > C: {B1: {T1: (T: chararray)}} > > grunt> D = FOREACH C GENERATE B1.$0 as date; > grunt> describe D; > D: {date: {T: chararray}} > > grunt> dump D; > ... > ({(2009-11-18),(09:32:43),(000),(color=blue)}) > ({(2009-11-18),(09:32:43),(000),(color=red)}) > ({(2009-11-18),(09:32:44),(000),(color=red)}) > ({(2009-11-18),(09:32:45),(000),(color=green)}) > > What I'd expect to see is just the date values. > > Any ideas? > > thanks, > Bill > -- Zaki Rahaman
