I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the percentile
in pass 1, and then perform the filter in pass 2.


On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> wrote:

> Is there a UDF for generating the top X % of results?  For example, in a
> log
> parsing context, it might be the set of search queries that represent the
> top 80% of all queries.
>
> I see in the piggybank that there is a TOP function, but that only takes
> the
> top *number* of results, rather a percentile.
>
> Thanks
> Dave Viner
>

Reply via email to