Re: UDF for generating top xx % of results?

Thejas Nair Tue, 29 Jun 2010 16:14:40 -0700

This pig query will not give you the "top xx% of results" . Ie if xx = 50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like the
following -
(if you just want rows that have "count < (max(count) - min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in this
example)


Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already there
for simplicity 
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach  G generate COUNT(G);
store O into 'file.ordered';

 -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
the results 
store TOTAL_ROWS into 'total_rows.txt';



Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
the result stored in  total_rows.txt
L = load 'file.ordered';
LIM = limit L, $TOP_ROWS;
Store LIM into 'output';



On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:

> Actually, I've gotten the first half of the code to work now.   Here's how
> it looks:
> 
> 
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> C = GROUP B ALL;
> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
> V = FOREACH U GENERATE min + (max-min)*0.95;
> 
> V is appropriately set... but I can't perform the final step of actually
> filtering the values by that count.
> 
> The simple approach:
> TOP = FILTER B BY count > V;
> 
> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
> {query: chararray,count: long}
> 
> Same with SPLIT.
> 
> How do I filter or split on the value of V?
> 
> Dave Viner
> 
> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:
> 
>> I don't quite understand this pig latin.  The piggybank
>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
>> which are compared.  Here's the sample I'm trying using the tutorial
>> excitelog as a sample.
>> 
>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>> (user:chararray, time:long, query:chararray);
>> Z = FILTER X BY query is not null;
>> A = GROUP Z BY query;
>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>> U = FOREACH B GENERATE *,
>>     MIN(count) as min:long,
>>     MAX(count) as max:long;
>> 
>> This doesn't seem to work at all.  It dies with this error:
>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
>> Other Field Schema: min: long
>> 
>> Changing the min:long and max:long to doubles (as suggested by the error
>> message), causes this error:
>> ERROR 1045: Could not infer the matching function for
>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
>> explicit cast.
>> 
>> What am I missing in using the sample code you've provided?  I can't seem
>> to get it to work...
>> 
>> Thanks for your help.
>> Dave Viner
>> 
>> 
>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
>> 
>>> That's what I tried to say in my last email.I don't believe you can
>>> calculate exactly the percentiles in just one pass. Writing out the pig
>>> for
>>> two pass algorithm should be easy enough..
>>> 
>>> P = group TABLE all;
>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>>> V = foreach U generate min + (max-min)*0.95;
>>> 
>>> would give you the 95th percentile cutoff, and u just filter or split by
>>> V.
>>> 
>>> 
>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> wrote:
>>> 
>>>> How would I calculate the percentile in one pass?  In order to calculate
>>>> the
>>>> percentile for each item, I need to know the total count.  How do I get
>>> the
>>>> total count, and then calculate each item's percentile in one pass?
>>>> 
>>>> I don't mind doing multiple passes - I am just not sure how to make the
>>>> calculation.
>>>> 
>>>> Thanks
>>>> Dave Viner
>>>> 
>>>> 
>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
>>>> 
>>>>> I think it's impossible to do this within one M/R. You will want to
>>>>> implement it in two M/R in Pig, because you have to calculate the
>>>>> percentile
>>>>> in pass 1, and then perform the filter in pass 2.
>>>>> 
>>>>> 
>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
>>> wrote:
>>>>> 
>>>>>> Is there a UDF for generating the top X % of results?  For example,
>>> in
>>>> a
>>>>>> log
>>>>>> parsing context, it might be the set of search queries that
>>> represent
>>>> the
>>>>>> top 80% of all queries.
>>>>>> 
>>>>>> I see in the piggybank that there is a TOP function, but that only
>>>> takes
>>>>>> the
>>>>>> top *number* of results, rather a percentile.
>>>>>> 
>>>>>> Thanks
>>>>>> Dave Viner
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: UDF for generating top xx % of results?

Reply via email to