Re: UDF for generating top xx % of results?

Dave Viner Tue, 29 Jun 2010 15:01:20 -0700

I don't quite understand this pig latin.  The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared.  Here's the sample I'm trying using the tutorial
excitelog as a sample.


X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

This doesn't seem to work at all.  It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. Other
Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided?  I can't seem to
get it to work...

Thanks for your help.
Dave Viner


On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:

> That's what I tried to say in my last email.I don't believe you can
> calculate exactly the percentiles in just one pass. Writing out the pig for
> two pass algorithm should be easy enough..
>
> P = group TABLE all;
> U = foreach P generate MIN(x) as min, MAX(x) as max;
> V = foreach U generate min + (max-min)*0.95;
>
> would give you the 95th percentile cutoff, and u just filter or split by V.
>
>
> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> wrote:
>
> > How would I calculate the percentile in one pass?  In order to calculate
> > the
> > percentile for each item, I need to know the total count.  How do I get
> the
> > total count, and then calculate each item's percentile in one pass?
> >
> > I don't mind doing multiple passes - I am just not sure how to make the
> > calculation.
> >
> > Thanks
> > Dave Viner
> >
> >
> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
> >
> > > I think it's impossible to do this within one M/R. You will want to
> > > implement it in two M/R in Pig, because you have to calculate the
> > > percentile
> > > in pass 1, and then perform the filter in pass 2.
> > >
> > >
> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
> wrote:
> > >
> > > > Is there a UDF for generating the top X % of results?  For example,
> in
> > a
> > > > log
> > > > parsing context, it might be the set of search queries that represent
> > the
> > > > top 80% of all queries.
> > > >
> > > > I see in the piggybank that there is a TOP function, but that only
> > takes
> > > > the
> > > > top *number* of results, rather a percentile.
> > > >
> > > > Thanks
> > > > Dave Viner
> > > >
> > >
> >
>

Re: UDF for generating top xx % of results?

Reply via email to