Hi all, I was thinking about a nice project for Pig for this years' GsoC. One idea I had was to implement something similar to SQL Rank() function. RANK() returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question. Basically it assigns a consecutive unique identifier to each row (a row id).
In my experience this is a very useful feature. Of course, the naive solution would be to use 1 reducer and stamp each tuple in a bag with an increasing id. But there is an algorithm to do this in a parallel way (2 MR jobs). The idea would be to add a new operator (as this cannot be done with a UDF) that can rank bags/relations. Of course this could be used in conjunction with ORDER BY to define the specific rank order. Do you see this as an interesting project? Thanks, -- Gianmarco De Francisci Morales
