I think it's impossible to do this within one M/R. You will want to implement it in two M/R in Pig, because you have to calculate the percentile in pass 1, and then perform the filter in pass 2.
On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> wrote: > Is there a UDF for generating the top X % of results? For example, in a > log > parsing context, it might be the set of search queries that represent the > top 80% of all queries. > > I see in the piggybank that there is a TOP function, but that only takes > the > top *number* of results, rather a percentile. > > Thanks > Dave Viner >
