Hi Thejas,

Thanks for this suggestion.  Just to make sure I understand the strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number of
rows.

Then, I need a small script that reads the value of "total_rows", multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
 The second pig query outputs the desired file in "output" which contains
only those rows which have a count above the given threshold.

Is that correct?

Thanks
Dave Viner



On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> wrote:

> This pig query will not give you the "top xx% of results" . Ie if xx = 50%
> and results = 200, you may not get 100 rows in output.
> If you really want "top xx% of results", you need to do something like the
> following -
> (if you just want rows that have "count < (max(count) - min(count))*0.95,
> you can store and use the value of V the way TOTAL_ROWS is stored in this
> example)
>
> Pig query 1
> -----------
> L = load 'file' as (query, count); -- assuming that count is already there
> for simplicity
> O = order L by count; -- order it by column that defines 'top'
> G = group L by 'all';
> TOTAL_ROWS = foreach  G generate COUNT(G);
> store O into 'file.ordered';
>
>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
> the results
> store TOTAL_ROWS into 'total_rows.txt';
>
>
>
> Pig query 2
> -----------
> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
> the result stored in  total_rows.txt
> L = load 'file.ordered';
> LIM = limit L, $TOP_ROWS;
> Store LIM into 'output';
>
>
>
> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:
>
> > Actually, I've gotten the first half of the code to work now.   Here's
> how
> > it looks:
> >
> >
> > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> > (user:chararray, time:long, query:chararray);
> > Z = FILTER X BY query is not null;
> > A = GROUP Z BY query;
> > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> > C = GROUP B ALL;
> > U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
> max:long;
> > V = FOREACH U GENERATE min + (max-min)*0.95;
> >
> > V is appropriately set... but I can't perform the final step of actually
> > filtering the values by that count.
> >
> > The simple approach:
> > TOP = FILTER B BY count > V;
> >
> > doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> > {query: chararray,count: long}
> >
> > Same with SPLIT.
> >
> > How do I filter or split on the value of V?
> >
> > Dave Viner
> >
> > On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:
> >
> >> I don't quite understand this pig latin.  The piggybank
> >> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> >> which are compared.  Here's the sample I'm trying using the tutorial
> >> excitelog as a sample.
> >>
> >> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> >> (user:chararray, time:long, query:chararray);
> >> Z = FILTER X BY query is not null;
> >> A = GROUP Z BY query;
> >> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> >> U = FOREACH B GENERATE *,
> >>     MIN(count) as min:long,
> >>     MAX(count) as max:long;
> >>
> >> This doesn't seem to work at all.  It dies with this error:
> >> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> >> Other Field Schema: min: long
> >>
> >> Changing the min:long and max:long to doubles (as suggested by the error
> >> message), causes this error:
> >> ERROR 1045: Could not infer the matching function for
> >> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
> an
> >> explicit cast.
> >>
> >> What am I missing in using the sample code you've provided?  I can't
> seem
> >> to get it to work...
> >>
> >> Thanks for your help.
> >> Dave Viner
> >>
> >>
> >> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
> >>
> >>> That's what I tried to say in my last email.I don't believe you can
> >>> calculate exactly the percentiles in just one pass. Writing out the pig
> >>> for
> >>> two pass algorithm should be easy enough..
> >>>
> >>> P = group TABLE all;
> >>> U = foreach P generate MIN(x) as min, MAX(x) as max;
> >>> V = foreach U generate min + (max-min)*0.95;
> >>>
> >>> would give you the 95th percentile cutoff, and u just filter or split
> by
> >>> V.
> >>>
> >>>
> >>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
> wrote:
> >>>
> >>>> How would I calculate the percentile in one pass?  In order to
> calculate
> >>>> the
> >>>> percentile for each item, I need to know the total count.  How do I
> get
> >>> the
> >>>> total count, and then calculate each item's percentile in one pass?
> >>>>
> >>>> I don't mind doing multiple passes - I am just not sure how to make
> the
> >>>> calculation.
> >>>>
> >>>> Thanks
> >>>> Dave Viner
> >>>>
> >>>>
> >>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
> >>>>
> >>>>> I think it's impossible to do this within one M/R. You will want to
> >>>>> implement it in two M/R in Pig, because you have to calculate the
> >>>>> percentile
> >>>>> in pass 1, and then perform the filter in pass 2.
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> Is there a UDF for generating the top X % of results?  For example,
> >>> in
> >>>> a
> >>>>>> log
> >>>>>> parsing context, it might be the set of search queries that
> >>> represent
> >>>> the
> >>>>>> top 80% of all queries.
> >>>>>>
> >>>>>> I see in the piggybank that there is a TOP function, but that only
> >>>> takes
> >>>>>> the
> >>>>>> top *number* of results, rather a percentile.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dave Viner
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Reply via email to