Hey Bill,

If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in
the contrib dir of your pig installation, you'll find several functions that
might help. I haven't used any myself, but in
org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader.
If you pass a reg exp with capturing groups I believe you can simply use
these functions directly. There are also apache log specific load funcs, I
think theres Common and Combined Log Loaders... simply set up your scripts
to use those functions to load your input data and you'll have what you need
I believe.



On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan
<[email protected]>wrote:

>
>
> You are right, there is no ordering of tuples within a bag by default
> (except in some cases - like output of ORDER BY).
>
> For the specific purpose of pulling the date field - you could just use
> some regexp udf instead of tokenize to pick the value you are interested in.
>
> There should be udf's in piggy bank which do this ...
>
>
>
> Or is this a more general question regarding accessing tuples within a bag
> in some ordered fashion ?
>
>
> Regards,
> Mridul
>
>
>
>
> Bill Graham wrote:
>
>> Hi,
>>
>> I'm struggling to get the tokens out of a bag of tuples created by the
>> TOKENIZE UDF and could use some help. I want to tokenize and then be able
>> to
>> reference the tokens by their position. Is this even possible? Since the
>> token count is non-deterministic, I'm question whether I can use
>> positional
>> parameters to dig them out.
>>
>> Anyway, here's what I'm doing, starting with a chararray where each:
>>
>> grunt> describe B;
>> B: {body: chararray}
>> grunt> dump B;
>> (2009-11-18 09:32:43,000 color=blue)
>> (2009-11-18 09:32:43,000 color=red)
>> (2009-11-18 09:32:44,000 color=red)
>> (2009-11-18 09:32:45,000 color=green)
>>
>> grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as
>> B1:bag{T1:tuple(T:chararray)};
>> grunt> describe C;
>> C: {B1: {T1: (T: chararray)}}
>>
>> grunt> D = FOREACH C GENERATE B1.$0 as date;
>> grunt> describe D;
>> D: {date: {T: chararray}}
>>
>> grunt> dump D;
>> ...
>> ({(2009-11-18),(09:32:43),(000),(color=blue)})
>> ({(2009-11-18),(09:32:43),(000),(color=red)})
>> ({(2009-11-18),(09:32:44),(000),(color=red)})
>> ({(2009-11-18),(09:32:45),(000),(color=green)})
>>
>> What I'd expect to see is just the date values.
>>
>> Any ideas?
>>
>> thanks,
>> Bill
>>
>
>


-- 
Zaki Rahaman

Reply via email to