Re: UDF for generating top xx % of results?

Dave Viner Tue, 29 Jun 2010 18:43:58 -0700

Ok.  I think I see how it's working.

As a side note to this conversation, the wiki at
http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what I
was looking for *should* be possible with this code:


A = load 'data1';
B = group A all;
C = foreach B generate COUNT(B);
store C into 'count';
D = load 'data2';
E = foreach D generate $1/GetScalar(C);


However, GetScalar is no longer present in 0.7.0.

Thanks for your help.

Dave Viner



On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <[email protected]> wrote:

>
>
>
> On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote:
>
> > Hi Thejas,
> >
> > Thanks for this suggestion.  Just to make sure I understand the strategy
> > here:
> >
> > Pig query 1 outputs 2 files (or directories w/ files in them):
> > "file.ordered" which contains the data set (query, count) in ascending
> order
> > of count; and "total_rows" which contains 1 line with the total number of
> > rows.
> >
> > Then, I need a small script that reads the value of "total_rows",
> multiplies
> > that value by the desired percentile (0.95 in the example), and invokes
> pig
> > a second time, passing the "TOP_ROWS" value as a define on the command
> line.
> >  The second pig query outputs the desired file in "output" which contains
> > only those rows which have a count above the given threshold.
> >
>
>
> Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
> not all rows that have a value of count above given threshold. So it is
> possible that a few rows that are not in output have same value of count as
> the last result in "output" .
>  For example if input has 100 rows with 1 as the count value, you will any
> 95 rows in the output.
>
> -Thejas
>
>
>
> > Is that correct?
> >
> > Thanks
> > Dave Viner
> >
> >
> >
> > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]>
> wrote:
> >
> >> This pig query will not give you the "top xx% of results" . Ie if xx =
> 50%
> >> and results = 200, you may not get 100 rows in output.
> >> If you really want "top xx% of results", you need to do something like
> the
> >> following -
> >> (if you just want rows that have "count < (max(count) -
> min(count))*0.95,
> >> you can store and use the value of V the way TOTAL_ROWS is stored in
> this
> >> example)
> >>
> >> Pig query 1
> >> -----------
> >> L = load 'file' as (query, count); -- assuming that count is already
> there
> >> for simplicity
> >> O = order L by count; -- order it by column that defines 'top'
> >> G = group L by 'all';
> >> TOTAL_ROWS = foreach  G generate COUNT(G);
> >> store O into 'file.ordered';
> >>
> >>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to
> filter
> >> the results
> >> store TOTAL_ROWS into 'total_rows.txt';
> >>
> >>
> >>
> >> Pig query 2
> >> -----------
> >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
> is
> >> the result stored in  total_rows.txt
> >> L = load 'file.ordered';
> >> LIM = limit L, $TOP_ROWS;
> >> Store LIM into 'output';
> >>
> >>
> >>
> >> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:
> >>
> >>> Actually, I've gotten the first half of the code to work now.   Here's
> >> how
> >>> it looks:
> >>>
> >>>
> >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >>> (user:chararray, time:long, query:chararray);
> >>> Z = FILTER X BY query is not null;
> >>> A = GROUP Z BY query;
> >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> count:long;
> >>> C = GROUP B ALL;
> >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> >> max:long;
> >>> V = FOREACH U GENERATE min + (max-min)*0.95;
> >>>
> >>> V is appropriately set... but I can't perform the final step of
> actually
> >>> filtering the values by that count.
> >>>
> >>> The simple approach:
> >>> TOP = FILTER B BY count > V;
> >>>
> >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> >>> {query: chararray,count: long}
> >>>
> >>> Same with SPLIT.
> >>>
> >>> How do I filter or split on the value of V?
> >>>
> >>> Dave Viner
> >>>
> >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]>
> wrote:
> >>>
> >>>> I don't quite understand this pig latin.  The piggybank
> >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2
> parameters
> >>>> which are compared.  Here's the sample I'm trying using the tutorial
> >>>> excitelog as a sample.
> >>>>
> >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >>>> (user:chararray, time:long, query:chararray);
> >>>> Z = FILTER X BY query is not null;
> >>>> A = GROUP Z BY query;
> >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> count:long;
> >>>> U = FOREACH B GENERATE *,
> >>>>     MIN(count) as min:long,
> >>>>     MAX(count) as max:long;
> >>>>
> >>>> This doesn't seem to work at all.  It dies with this error:
> >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> >>>> Other Field Schema: min: long
> >>>>
> >>>> Changing the min:long and max:long to doubles (as suggested by the
> error
> >>>> message), causes this error:
> >>>> ERROR 1045: Could not infer the matching function for
> >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
> >> an
> >>>> explicit cast.
> >>>>
> >>>> What am I missing in using the sample code you've provided?  I can't
> >> seem
> >>>> to get it to work...
> >>>>
> >>>> Thanks for your help.
> >>>> Dave Viner
> >>>>
> >>>>
> >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
> >>>>
> >>>>> That's what I tried to say in my last email.I don't believe you can
> >>>>> calculate exactly the percentiles in just one pass. Writing out the
> pig
> >>>>> for
> >>>>> two pass algorithm should be easy enough..
> >>>>>
> >>>>> P = group TABLE all;
> >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >>>>> V = foreach U generate min + (max-min)*0.95;
> >>>>>
> >>>>> would give you the 95th percentile cutoff, and u just filter or split
> >> by
> >>>>> V.
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> How would I calculate the percentile in one pass?  In order to
> >> calculate
> >>>>>> the
> >>>>>> percentile for each item, I need to know the total count.  How do I
> >> get
> >>>>> the
> >>>>>> total count, and then calculate each item's percentile in one pass?
> >>>>>>
> >>>>>> I don't mind doing multiple passes - I am just not sure how to make
> >> the
> >>>>>> calculation.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dave Viner
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
> >>>>>>
> >>>>>>> I think it's impossible to do this within one M/R. You will want to
> >>>>>>> implement it in two M/R in Pig, because you have to calculate the
> >>>>>>> percentile
> >>>>>>> in pass 1, and then perform the filter in pass 2.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Is there a UDF for generating the top X % of results?  For
> example,
> >>>>> in
> >>>>>> a
> >>>>>>>> log
> >>>>>>>> parsing context, it might be the set of search queries that
> >>>>> represent
> >>>>>> the
> >>>>>>>> top 80% of all queries.
> >>>>>>>>
> >>>>>>>> I see in the piggybank that there is a TOP function, but that only
> >>>>>> takes
> >>>>>>>> the
> >>>>>>>> top *number* of results, rather a percentile.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Dave Viner
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: UDF for generating top xx % of results?

Reply via email to