Alan Gates
Fri, 20 Nov 2009 12:53:54 -0800
If your data set is small enough that you can do the labeling on a single machine you can do something like:
A = load 'data';
B = group A all parallel 1;
C = foreach B {
D = order A by sortkey;
generate flatten(lablerUDF(D));
}
where lablerUDF is a UDF you write that walks through the bag and
appends the position to each tuple. If you are doing this on trunk I
was strongly suggest using the new accumulator interface for this UDF
as it will make is much more efficient. But again, this depends on
pulling all of your data onto one machine, which defeats the purpose
of parallel systems like Hadoop.
Alan. On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote:
I am using PIG and this is what I am trying to do this:1) Sort a relation A into B by a field x. The smallest value of x is first.Just use SORT.2) Label each tuple in B with a number denoting its order in the sorted relation. So the first tuple would be labeled with a 1, the second tuplewith a 2, the third with a 3 and so on. Not certain how to do this.3) Derive a relation C where each row is a bag of tuples. The first row contains the first n1 tuples from relation B, the second row contains the tuples from B labeled (n1 + 1) to n2 from, the third row contains the tuples from B labeled (n2 + 1) to n3 and so on to n100. This step is simple (justuse filter) once we've labeled each tuple in B with a number. The question: how do I do step 2)? thanks