That's what I tried to say in my last email.I don't believe you can calculate exactly the percentiles in just one pass. Writing out the pig for two pass algorithm should be easy enough..
P = group TABLE all; U = foreach P generate MIN(x) as min, MAX(x) as max; V = foreach U generate min + (max-min)*0.95; would give you the 95th percentile cutoff, and u just filter or split by V. On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> wrote: > How would I calculate the percentile in one pass? In order to calculate > the > percentile for each item, I need to know the total count. How do I get the > total count, and then calculate each item's percentile in one pass? > > I don't mind doing multiple passes - I am just not sure how to make the > calculation. > > Thanks > Dave Viner > > > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote: > > > I think it's impossible to do this within one M/R. You will want to > > implement it in two M/R in Pig, because you have to calculate the > > percentile > > in pass 1, and then perform the filter in pass 2. > > > > > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> wrote: > > > > > Is there a UDF for generating the top X % of results? For example, in > a > > > log > > > parsing context, it might be the set of search queries that represent > the > > > top 80% of all queries. > > > > > > I see in the piggybank that there is a TOP function, but that only > takes > > > the > > > top *number* of results, rather a percentile. > > > > > > Thanks > > > Dave Viner > > > > > >
