Re: Pig Relation Sorting, labeling, partitioning

Alan Gates Fri, 20 Nov 2009 12:53:54 -0800

Item 2 is no currently easy to do in Pig in a parallel fashion. Thisis because you don't know how many records each map task is going toget so you don't know which number to start on in map 2 and greater.You could write a complex two pass algorithm that were first count thenumber of tuples and then do the splits again, but it would involveimplementing your own Slicer and LoadFunc.

If your data set is small enough that you can do the labeling on asingle machine you can do something like:


A = load 'data';
B = group A all parallel 1;
C = foreach B {
       D = order A by sortkey;
       generate flatten(lablerUDF(D));
}

where lablerUDF is a UDF you write that walks through the bag andappends the position to each tuple. If you are doing this on trunk Iwas strongly suggest using the new accumulator interface for this UDFas it will make is much more efficient. But again, this depends onpulling all of your data onto one machine, which defeats the purposeof parallel systems like Hadoop.


Alan.

On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote:

I am using PIG and this is what I am trying to do this:
1) Sort a relation A into B by a field x. The smallest value of x isfirst.
Just use SORT.
2) Label each tuple in B with a number denoting its order in thesortedrelation. So the first tuple would be labeled with a 1, the secondtuple
with a 2, the third with a 3 and so on. Not certain how to do this.
3) Derive a relation C where each row is a bag of tuples. The firstrowcontains the first n1 tuples from relation B, the second rowcontains thetuples from B labeled (n1 + 1) to n2 from, the third row containsthe tuplesfrom B labeled (n2 + 1) to n3 and so on to n100. This step is simple(just
use filter) once we've labeled each tuple in B with a number.

The question: how do I do step 2)?

thanks

Re: Pig Relation Sorting, labeling, partitioning

Reply via email to