Using group by and foreach you can get tuples like this:

(A, {(1L,1L),(2L,2L),(3L,6L),(5L,1L)})

By counting the number of tuples in the bag, you can then find the
missing values.

Here is the script:

L = load 'X' using PigStorage(',') as (a:chararray, b:long, c:long);
G = group L by a;
F = foreach G { O = order L by b; generate group, O.(b, c); }
dump F

Thanks
-Richard

-----Original Message-----
From: Greg Langmead [mailto:[email protected]] 
Sent: Wednesday, May 05, 2010 2:15 PM
To: [email protected]
Subject: Re: Help identifying missing value

My example of a combined tuple should have A and not $-NT or $NT, and
same for the map:

(A, 1L, 2L, 6L, 0L, 1L)

(A, 1L#1L, 2L#2L, 3L#6L, 5L#1L)

On May 5, 2010, at 5:06 PM, Greg Langmead wrote:

> At an intermediate point in my processing, I have these tuples:
> 
> DUMP X;
> (A,1L,1L)
> (A,2L,2L)
> (A,3L,6L)
> (A,5L,1L)
> 
> The middle element of these tuples can have any integer value from
1-5, and the third element can have any positive integer value. (These
data points mean, for example for the third tuple, "I saw 6 distinct
words that started with the letter A that occurred 3 times each.") My
problem is that to do the math I need to do next, I need to know that
there were 0 words that occurred 4 times, so I need to group these four
tuples into one record that permits me to ask "what is the value that
goes with 1, ... what is the value that goes with 5".
> 
> I could stream these through a script and do what I want, but I'm new
to Pig and I'd like to explore what can be done strictly within Pig.
> 
> Maybe I could gather these into a tuple, but with a 0 at the position
for 4:
> 
> ($-NT,1L,2L,6L,0L,1L)
> 
> or else somehow generate a map from this:
> 
> ($NT, 1L#1L, 2L#2L, 3L#6L, 5L#1L)
> 
> which would also alert me to the absence of 4L. Can I do either of
these things?
> 
> Thanks,
> Greg Langmead
> Research Scientist
> Language Weaver, Inc.

Reply via email to