Re: UDF for generating top xx % of results?

hc busy Wed, 30 Jun 2010 09:04:10 -0700

ahh, I see, i seem to have misread the question, if it's top xx% entries,
then certainly sorting and then limiting.


@Thejas  I had thought that Limit is distributed and does not guarantee u
get the results in order ??

@Dave
here's what to do after you get the min + (max-min)*.95 into V which now has
just one row.

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95 as v;
D = cross TABLE, V;
E = filter D by D.x <= V.v;

We should really start a wiki page to store these common calculations in
Pig... I'd imagine a lot of people out there needed to do this at one time
or another.

This is also why, imnsho, I'd like to see cross as a nested_op inside
foreach, because of this type of application.


On Tue, Jun 29, 2010 at 6:42 PM, Dave Viner <[email protected]> wrote:

> Ok.  I think I see how it's working.
>
> As a side note to this conversation, the wiki at
> http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what
> I
> was looking for *should* be possible with this code:
>
> A = load 'data1';
> B = group A all;
> C = foreach B generate COUNT(B);
> store C into 'count';
> D = load 'data2';
> E = foreach D generate $1/GetScalar(C);
>
>
> However, GetScalar is no longer present in 0.7.0.
>
> Thanks for your help.
>
> Dave Viner
>
>
>
> On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <[email protected]> wrote:
>
> >
> >
> >
> > On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote:
> >
> > > Hi Thejas,
> > >
> > > Thanks for this suggestion.  Just to make sure I understand the
> strategy
> > > here:
> > >
> > > Pig query 1 outputs 2 files (or directories w/ files in them):
> > > "file.ordered" which contains the data set (query, count) in ascending
> > order
> > > of count; and "total_rows" which contains 1 line with the total number
> of
> > > rows.
> > >
> > > Then, I need a small script that reads the value of "total_rows",
> > multiplies
> > > that value by the desired percentile (0.95 in the example), and invokes
> > pig
> > > a second time, passing the "TOP_ROWS" value as a define on the command
> > line.
> > >  The second pig query outputs the desired file in "output" which
> contains
> > > only those rows which have a count above the given threshold.
> > >
> >
> >
> > Yes, you are right. But the "output" contains 0.95 * total_num_rows, it
> is
> > not all rows that have a value of count above given threshold. So it is
> > possible that a few rows that are not in output have same value of count
> as
> > the last result in "output" .
> >  For example if input has 100 rows with 1 as the count value, you will
> any
> > 95 rows in the output.
> >
> > -Thejas
> >
> >
> >
> > > Is that correct?
> > >
> > > Thanks
> > > Dave Viner
> > >
> > >
> > >
> > > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]>
> > wrote:
> > >
> > >> This pig query will not give you the "top xx% of results" . Ie if xx =
> > 50%
> > >> and results = 200, you may not get 100 rows in output.
> > >> If you really want "top xx% of results", you need to do something like
> > the
> > >> following -
> > >> (if you just want rows that have "count < (max(count) -
> > min(count))*0.95,
> > >> you can store and use the value of V the way TOTAL_ROWS is stored in
> > this
> > >> example)
> > >>
> > >> Pig query 1
> > >> -----------
> > >> L = load 'file' as (query, count); -- assuming that count is already
> > there
> > >> for simplicity
> > >> O = order L by count; -- order it by column that defines 'top'
> > >> G = group L by 'all';
> > >> TOTAL_ROWS = foreach  G generate COUNT(G);
> > >> store O into 'file.ordered';
> > >>
> > >>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to
> > filter
> > >> the results
> > >> store TOTAL_ROWS into 'total_rows.txt';
> > >>
> > >>
> > >>
> > >> Pig query 2
> > >> -----------
> > >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
> > is
> > >> the result stored in  total_rows.txt
> > >> L = load 'file.ordered';
> > >> LIM = limit L, $TOP_ROWS;
> > >> Store LIM into 'output';
> > >>
> > >>
> > >>
> > >> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:
> > >>
> > >>> Actually, I've gotten the first half of the code to work now.
> Here's
> > >> how
> > >>> it looks:
> > >>>
> > >>>
> > >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > >>> (user:chararray, time:long, query:chararray);
> > >>> Z = FILTER X BY query is not null;
> > >>> A = GROUP Z BY query;
> > >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> > count:long;
> > >>> C = GROUP B ALL;
> > >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> > >> max:long;
> > >>> V = FOREACH U GENERATE min + (max-min)*0.95;
> > >>>
> > >>> V is appropriately set... but I can't perform the final step of
> > actually
> > >>> filtering the values by that count.
> > >>>
> > >>> The simple approach:
> > >>> TOP = FILTER B BY count > V;
> > >>>
> > >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> > >>> {query: chararray,count: long}
> > >>>
> > >>> Same with SPLIT.
> > >>>
> > >>> How do I filter or split on the value of V?
> > >>>
> > >>> Dave Viner
> > >>>
> > >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]>
> > wrote:
> > >>>
> > >>>> I don't quite understand this pig latin.  The piggybank
> > >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2
> > parameters
> > >>>> which are compared.  Here's the sample I'm trying using the tutorial
> > >>>> excitelog as a sample.
> > >>>>
> > >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > >>>> (user:chararray, time:long, query:chararray);
> > >>>> Z = FILTER X BY query is not null;
> > >>>> A = GROUP Z BY query;
> > >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
> > count:long;
> > >>>> U = FOREACH B GENERATE *,
> > >>>>     MIN(count) as min:long,
> > >>>>     MAX(count) as max:long;
> > >>>>
> > >>>> This doesn't seem to work at all.  It dies with this error:
> > >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema:
> double.
> > >>>> Other Field Schema: min: long
> > >>>>
> > >>>> Changing the min:long and max:long to doubles (as suggested by the
> > error
> > >>>> message), causes this error:
> > >>>> ERROR 1045: Could not infer the matching function for
> > >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please
> use
> > >> an
> > >>>> explicit cast.
> > >>>>
> > >>>> What am I missing in using the sample code you've provided?  I can't
> > >> seem
> > >>>> to get it to work...
> > >>>>
> > >>>> Thanks for your help.
> > >>>> Dave Viner
> > >>>>
> > >>>>
> > >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]>
> wrote:
> > >>>>
> > >>>>> That's what I tried to say in my last email.I don't believe you can
> > >>>>> calculate exactly the percentiles in just one pass. Writing out the
> > pig
> > >>>>> for
> > >>>>> two pass algorithm should be easy enough..
> > >>>>>
> > >>>>> P = group TABLE all;
> > >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> > >>>>> V = foreach U generate min + (max-min)*0.95;
> > >>>>>
> > >>>>> would give you the 95th percentile cutoff, and u just filter or
> split
> > >> by
> > >>>>> V.
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
> > >> wrote:
> > >>>>>
> > >>>>>> How would I calculate the percentile in one pass?  In order to
> > >> calculate
> > >>>>>> the
> > >>>>>> percentile for each item, I need to know the total count.  How do
> I
> > >> get
> > >>>>> the
> > >>>>>> total count, and then calculate each item's percentile in one
> pass?
> > >>>>>>
> > >>>>>> I don't mind doing multiple passes - I am just not sure how to
> make
> > >> the
> > >>>>>> calculation.
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>> Dave Viner
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]>
> wrote:
> > >>>>>>
> > >>>>>>> I think it's impossible to do this within one M/R. You will want
> to
> > >>>>>>> implement it in two M/R in Pig, because you have to calculate the
> > >>>>>>> percentile
> > >>>>>>> in pass 1, and then perform the filter in pass 2.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Is there a UDF for generating the top X % of results?  For
> > example,
> > >>>>> in
> > >>>>>> a
> > >>>>>>>> log
> > >>>>>>>> parsing context, it might be the set of search queries that
> > >>>>> represent
> > >>>>>> the
> > >>>>>>>> top 80% of all queries.
> > >>>>>>>>
> > >>>>>>>> I see in the piggybank that there is a TOP function, but that
> only
> > >>>>>> takes
> > >>>>>>>> the
> > >>>>>>>> top *number* of results, rather a percentile.
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>> Dave Viner
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: UDF for generating top xx % of results?

Reply via email to