Hm,

I may be wrong about this, but from what I recall, there are no 'fields' in
the bag of tokens (and no ordering) created by TOKENIZE. As such, I don't
think there's a way to accomplish what you're trying to do the way it's
written. As an alternative approach, you might try using FLATTEN to unnest
the TOKENIZE output and give you tuples for each token and then filter the
tokens to those that match your date pattern. Alternatively, you could
accomplish this in one step with a regex extract UDF (there's one in
piggybank if I recall correctly and something similar in amazon's pig
function jar). If the data you described below is your input data, then you
could remove the projection step altogether by using a RegEx LoadFunc to get
the date field. Hope this helps, and others feel free to correct me if I'm
wrong, as I'm sure there's probably a better/more elegant way.

-- Zaki

On Wed, Nov 18, 2009 at 3:03 PM, Bill Graham <[email protected]> wrote:

> Hi,
>
> I'm struggling to get the tokens out of a bag of tuples created by the
> TOKENIZE UDF and could use some help. I want to tokenize and then be able
> to
> reference the tokens by their position. Is this even possible? Since the
> token count is non-deterministic, I'm question whether I can use positional
> parameters to dig them out.
>
> Anyway, here's what I'm doing, starting with a chararray where each:
>
> grunt> describe B;
> B: {body: chararray}
> grunt> dump B;
> (2009-11-18 09:32:43,000 color=blue)
> (2009-11-18 09:32:43,000 color=red)
> (2009-11-18 09:32:44,000 color=red)
> (2009-11-18 09:32:45,000 color=green)
>
> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as
> B1:bag{T1:tuple(T:chararray)};
> grunt> describe C;
> C: {B1: {T1: (T: chararray)}}
>
> grunt> D = FOREACH C GENERATE B1.$0 as date;
> grunt> describe D;
> D: {date: {T: chararray}}
>
> grunt> dump D;
> ...
> ({(2009-11-18),(09:32:43),(000),(color=blue)})
> ({(2009-11-18),(09:32:43),(000),(color=red)})
> ({(2009-11-18),(09:32:44),(000),(color=red)})
> ({(2009-11-18),(09:32:45),(000),(color=green)})
>
> What I'd expect to see is just the date values.
>
> Any ideas?
>
> thanks,
> Bill
>



-- 
Zaki Rahaman

Reply via email to