RANK implementation in Pig

Gianmarco Tue, 22 Mar 2011 11:37:20 -0700

Hi all,

I was thinking about a nice project for Pig for this years' GsoC.
One idea I had was to implement something similar to SQL Rank() function.
RANK() returns the rank of each row within the partition of a result set.
The rank of a row is one plus the number of ranks that come before the row
in question.
Basically it assigns a consecutive unique identifier to each row (a row id).


In my experience this is a very useful feature.
Of course, the naive solution would be to use 1 reducer and stamp each tuple
in a bag with an increasing id.
But there is an algorithm to do this in a parallel way (2 MR jobs).
The idea would be to add a new operator (as this cannot be done with a UDF)
that can rank bags/relations.
Of course this could be used in conjunction with ORDER BY to define the
specific rank order.

Do you see this as an interesting project?

Thanks,
--
Gianmarco De Francisci Morales

RANK implementation in Pig

Reply via email to