If you are ok with approximately dividing A into 100 parts on sorted order ,
you could do 
B = order A by x parallel 100;
That will generate 100 part files, (somewhat) evenly distributing the data
across them. Pig samples the input data first and generates a histogram to
try to evenly spread the data across reducers.

-Thejas



On 11/20/09 12:53 PM, "Alan Gates" <[email protected]> wrote:

> Item 2 is no currently easy to do in Pig in a parallel fashion.  This
> is because you don't know how many records each map task is going to
> get so you don't know which number to start on in map 2 and greater.
> You could write a complex two pass algorithm that were first count the
> number of tuples and then do the splits again, but it would involve
> implementing your own Slicer and LoadFunc.
> 
> If your data set is small enough that you can do the labeling on a
> single machine you can do something like:
> 
> A = load 'data';
> B = group A all parallel 1;
> C = foreach B {
>         D = order A by sortkey;
>         generate flatten(lablerUDF(D));
> }
> 
> where lablerUDF is a UDF you write that walks through the bag and
> appends the position to each tuple.  If you are doing this on trunk I
> was strongly suggest using the new accumulator interface for this UDF
> as it will make is much more efficient.  But again, this depends on
> pulling all of your data onto one machine, which defeats the purpose
> of parallel systems like Hadoop.
> 
> Alan.
> 
> On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote:
> 
>> I am using PIG and this is what I am trying to do this:
>> 
>> 1) Sort a relation A into B by a field x. The smallest value of x is
>> first.
>> Just use SORT.
>> 
>> 2) Label each tuple in B with a number denoting its order in the
>> sorted
>> relation. So the first tuple would be labeled with a 1, the second
>> tuple
>> with a 2, the third with a 3 and so on. Not certain how to do this.
>> 
>> 3) Derive a relation C where each row is a bag of tuples. The first
>> row
>> contains the first n1 tuples from relation B, the second row
>> contains the
>> tuples from B labeled (n1 + 1) to n2 from, the third row contains
>> the tuples
>> from B labeled (n2 + 1) to n3 and so on to n100. This step is simple
>> (just
>> use filter) once we've labeled each tuple in B with a number.
>> 
>> The question: how do I do step 2)?
>> 
>> thanks
> 

Reply via email to