Re: UDF for generating top xx % of results?

Dave Viner Tue, 29 Jun 2010 16:15:06 -0700

Hi Aniket,

Is it possible to pass an alias to a UDF?  In other words could I do:


TOP = FILTER B BY myCompare(count, V)

or something similar?

I don't quite understand your idea of using STORE.  If I do:

STORE V into 'threshold';
XX = LOAD 'threshold' AS (threshold:long);

How can I filter the count by XX?  Seems like I'm back in the same
situation.

Do you mean doing the STORE(), then directly modifying the files themselves
before doing the LOAD() so that it contains one row for each value of B,
with the threshold value V appended to the end?

Dave Viner

On Tue, Jun 29, 2010 at 3:57 PM, Aniket Mokashi <[email protected]>wrote:

> Hi Dave,
>
> TOP = FILTER B BY count > V; would not work right now with pig. Alias can
> be
> casted into scalars after https://issues.apache.org/jira/browse/PIG-1434is
> added. As far as I know, there is no simple way of using values generated
> by
> mapreduce jobs directly into Pig jobs, you will need UDF to do that.
> One dirty approach is to store this value in a location with a store and
> then later use it to generate a new file with $0 and V and filter with $0 >
> $1.
>
> Thanks,
> Aniket
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Dave
> Viner
> Sent: Tuesday, June 29, 2010 3:24 PM
> To: [email protected]
> Subject: Re: UDF for generating top xx % of results?
>
> Actually, I've gotten the first half of the code to work now.   Here's how
> it looks:
>
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> C = GROUP B ALL;
> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
> V = FOREACH U GENERATE min + (max-min)*0.95;
>
> V is appropriately set... but I can't perform the final step of actually
> filtering the values by that count.
>
> The simple approach:
> TOP = FILTER B BY count > V;
>
> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> {query: chararray,count: long}
>
> Same with SPLIT.
>
> How do I filter or split on the value of V?
>
> Dave Viner
>
> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:
>
> > I don't quite understand this pig latin.  The piggybank
> > function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> > which are compared.  Here's the sample I'm trying using the tutorial
> > excitelog as a sample.
> >
> > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > (user:chararray, time:long, query:chararray);
> > Z = FILTER X BY query is not null;
> > A = GROUP Z BY query;
> > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> > U = FOREACH B GENERATE *,
> >     MIN(count) as min:long,
> >     MAX(count) as max:long;
> >
> > This doesn't seem to work at all.  It dies with this error:
> > ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> > Other Field Schema: min: long
> >
> > Changing the min:long and max:long to doubles (as suggested by the error
> > message), causes this error:
> > ERROR 1045: Could not infer the matching function for
> > org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> > explicit cast.
> >
> > What am I missing in using the sample code you've provided?  I can't seem
> > to get it to work...
> >
> > Thanks for your help.
> > Dave Viner
> >
> >
> > On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
> >
> >> That's what I tried to say in my last email.I don't believe you can
> >> calculate exactly the percentiles in just one pass. Writing out the pig
> >> for
> >> two pass algorithm should be easy enough..
> >>
> >> P = group TABLE all;
> >> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >> V = foreach U generate min + (max-min)*0.95;
> >>
> >> would give you the 95th percentile cutoff, and u just filter or split by
> >> V.
> >>
> >>
> >> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
> wrote:
> >>
> >> > How would I calculate the percentile in one pass?  In order to
> calculate
> >> > the
> >> > percentile for each item, I need to know the total count.  How do I
> get
> >> the
> >> > total count, and then calculate each item's percentile in one pass?
> >> >
> >> > I don't mind doing multiple passes - I am just not sure how to make
> the
> >> > calculation.
> >> >
> >> > Thanks
> >> > Dave Viner
> >> >
> >> >
> >> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
> >> >
> >> > > I think it's impossible to do this within one M/R. You will want to
> >> > > implement it in two M/R in Pig, because you have to calculate the
> >> > > percentile
> >> > > in pass 1, and then perform the filter in pass 2.
> >> > >
> >> > >
> >> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
> >> wrote:
> >> > >
> >> > > > Is there a UDF for generating the top X % of results?  For
> example,
> >> in
> >> > a
> >> > > > log
> >> > > > parsing context, it might be the set of search queries that
> >> represent
> >> > the
> >> > > > top 80% of all queries.
> >> > > >
> >> > > > I see in the piggybank that there is a TOP function, but that only
> >> > takes
> >> > > > the
> >> > > > top *number* of results, rather a percentile.
> >> > > >
> >> > > > Thanks
> >> > > > Dave Viner
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
>

Re: UDF for generating top xx % of results?

Reply via email to