Re: Accessing a bag of token tuples from TOKENIZE

Bill Graham Wed, 18 Nov 2009 13:02:49 -0800

Thanks Zaki, I think you're right about bag values lacking order and not
being able to be accessed by position.


I'll take a look at the regex UDF. What I'm ultimately trying to get is a
handle to each token in the body though, I'm just using date as an example.
I'd like to be able to pull these values out with one UDF execution per line
(as opposed to per field).

My input is basically access log entries and I need to get the different
space-delimited values in it. Seems like the thing to do would be to write
my own UDF that returns a tuple from the space-delimited tokens for each
line passed.

I'm sure this problem has been solved a million times before though, so if
anyone has a better suggestion I'd love to hear it. I recall talk about an
access log UDF at one point (maybe it was in hive), but I can't find any
references to it at the moment.


On Wed, Nov 18, 2009 at 12:38 PM, zaki rahaman <[email protected]>wrote:

> Hm,
>
> I may be wrong about this, but from what I recall, there are no 'fields' in
> the bag of tokens (and no ordering) created by TOKENIZE. As such, I don't
> think there's a way to accomplish what you're trying to do the way it's
> written. As an alternative approach, you might try using FLATTEN to unnest
> the TOKENIZE output and give you tuples for each token and then filter the
> tokens to those that match your date pattern. Alternatively, you could
> accomplish this in one step with a regex extract UDF (there's one in
> piggybank if I recall correctly and something similar in amazon's pig
> function jar). If the data you described below is your input data, then you
> could remove the projection step altogether by using a RegEx LoadFunc to get
> the date field. Hope this helps, and others feel free to correct me if I'm
> wrong, as I'm sure there's probably a better/more elegant way.
>
> -- Zaki
>
>
> On Wed, Nov 18, 2009 at 3:03 PM, Bill Graham <[email protected]> wrote:
>
>> Hi,
>>
>> I'm struggling to get the tokens out of a bag of tuples created by the
>> TOKENIZE UDF and could use some help. I want to tokenize and then be able
>> to
>> reference the tokens by their position. Is this even possible? Since the
>> token count is non-deterministic, I'm question whether I can use
>> positional
>> parameters to dig them out.
>>
>> Anyway, here's what I'm doing, starting with a chararray where each:
>>
>> grunt> describe B;
>> B: {body: chararray}
>> grunt> dump B;
>> (2009-11-18 09:32:43,000 color=blue)
>> (2009-11-18 09:32:43,000 color=red)
>> (2009-11-18 09:32:44,000 color=red)
>> (2009-11-18 09:32:45,000 color=green)
>>
>> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as
>> B1:bag{T1:tuple(T:chararray)};
>> grunt> describe C;
>> C: {B1: {T1: (T: chararray)}}
>>
>> grunt> D = FOREACH C GENERATE B1.$0 as date;
>> grunt> describe D;
>> D: {date: {T: chararray}}
>>
>> grunt> dump D;
>> ...
>> ({(2009-11-18),(09:32:43),(000),(color=blue)})
>> ({(2009-11-18),(09:32:43),(000),(color=red)})
>> ({(2009-11-18),(09:32:44),(000),(color=red)})
>> ({(2009-11-18),(09:32:45),(000),(color=green)})
>>
>> What I'd expect to see is just the date values.
>>
>> Any ideas?
>>
>> thanks,
>> Bill
>>
>
>
>
> --
> Zaki Rahaman
>
>

Re: Accessing a bag of token tuples from TOKENIZE

Reply via email to