If you are ok with approximately dividing A into 100 parts on sorted order , you could do B = order A by x parallel 100; That will generate 100 part files, (somewhat) evenly distributing the data across them. Pig samples the input data first and generates a histogram to try to evenly spread the data across reducers.
-Thejas On 11/20/09 12:53 PM, "Alan Gates" <[email protected]> wrote: > Item 2 is no currently easy to do in Pig in a parallel fashion. This > is because you don't know how many records each map task is going to > get so you don't know which number to start on in map 2 and greater. > You could write a complex two pass algorithm that were first count the > number of tuples and then do the splits again, but it would involve > implementing your own Slicer and LoadFunc. > > If your data set is small enough that you can do the labeling on a > single machine you can do something like: > > A = load 'data'; > B = group A all parallel 1; > C = foreach B { > D = order A by sortkey; > generate flatten(lablerUDF(D)); > } > > where lablerUDF is a UDF you write that walks through the bag and > appends the position to each tuple. If you are doing this on trunk I > was strongly suggest using the new accumulator interface for this UDF > as it will make is much more efficient. But again, this depends on > pulling all of your data onto one machine, which defeats the purpose > of parallel systems like Hadoop. > > Alan. > > On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote: > >> I am using PIG and this is what I am trying to do this: >> >> 1) Sort a relation A into B by a field x. The smallest value of x is >> first. >> Just use SORT. >> >> 2) Label each tuple in B with a number denoting its order in the >> sorted >> relation. So the first tuple would be labeled with a 1, the second >> tuple >> with a 2, the third with a 3 and so on. Not certain how to do this. >> >> 3) Derive a relation C where each row is a bag of tuples. The first >> row >> contains the first n1 tuples from relation B, the second row >> contains the >> tuples from B labeled (n1 + 1) to n2 from, the third row contains >> the tuples >> from B labeled (n2 + 1) to n3 and so on to n100. This step is simple >> (just >> use filter) once we've labeled each tuple in B with a number. >> >> The question: how do I do step 2)? >> >> thanks >
