You are right, there is no ordering of tuples within a bag by default (except in some cases - like output of ORDER BY).

For the specific purpose of pulling the date field - you could just use some regexp udf instead of tokenize to pick the value you are interested in.

There should be udf's in piggy bank which do this ...



Or is this a more general question regarding accessing tuples within a bag in some ordered fashion ?


Regards,
Mridul



Bill Graham wrote:
Hi,

I'm struggling to get the tokens out of a bag of tuples created by the
TOKENIZE UDF and could use some help. I want to tokenize and then be able to
reference the tokens by their position. Is this even possible? Since the
token count is non-deterministic, I'm question whether I can use positional
parameters to dig them out.

Anyway, here's what I'm doing, starting with a chararray where each:

grunt> describe B;
B: {body: chararray}
grunt> dump B;
(2009-11-18 09:32:43,000 color=blue)
(2009-11-18 09:32:43,000 color=red)
(2009-11-18 09:32:44,000 color=red)
(2009-11-18 09:32:45,000 color=green)

grunt> C = FOREACH B GENERATE TOKENIZE((chararray)body) as
B1:bag{T1:tuple(T:chararray)};
grunt> describe C;
C: {B1: {T1: (T: chararray)}}

grunt> D = FOREACH C GENERATE B1.$0 as date;
grunt> describe D;
D: {date: {T: chararray}}

grunt> dump D;
...
({(2009-11-18),(09:32:43),(000),(color=blue)})
({(2009-11-18),(09:32:43),(000),(color=red)})
({(2009-11-18),(09:32:44),(000),(color=red)})
({(2009-11-18),(09:32:45),(000),(color=green)})

What I'd expect to see is just the date values.

Any ideas?

thanks,
Bill

Reply via email to