On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote:

> Hi Thejas,
> 
> Thanks for this suggestion.  Just to make sure I understand the strategy
> here:
> 
> Pig query 1 outputs 2 files (or directories w/ files in them):
> "file.ordered" which contains the data set (query, count) in ascending order
> of count; and "total_rows" which contains 1 line with the total number of
> rows.
> 
> Then, I need a small script that reads the value of "total_rows", multiplies
> that value by the desired percentile (0.95 in the example), and invokes pig
> a second time, passing the "TOP_ROWS" value as a define on the command line.
>  The second pig query outputs the desired file in "output" which contains
> only those rows which have a count above the given threshold.
> 


Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
 For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.

-Thejas



> Is that correct?
> 
> Thanks
> Dave Viner
> 
> 
> 
> On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> wrote:
> 
>> This pig query will not give you the "top xx% of results" . Ie if xx = 50%
>> and results = 200, you may not get 100 rows in output.
>> If you really want "top xx% of results", you need to do something like the
>> following -
>> (if you just want rows that have "count < (max(count) - min(count))*0.95,
>> you can store and use the value of V the way TOTAL_ROWS is stored in this
>> example)
>> 
>> Pig query 1
>> -----------
>> L = load 'file' as (query, count); -- assuming that count is already there
>> for simplicity
>> O = order L by count; -- order it by column that defines 'top'
>> G = group L by 'all';
>> TOTAL_ROWS = foreach  G generate COUNT(G);
>> store O into 'file.ordered';
>> 
>>  -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
>> the results
>> store TOTAL_ROWS into 'total_rows.txt';
>> 
>> 
>> 
>> Pig query 2
>> -----------
>> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
>> the result stored in  total_rows.txt
>> L = load 'file.ordered';
>> LIM = limit L, $TOP_ROWS;
>> Store LIM into 'output';
>> 
>> 
>> 
>> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:
>> 
>>> Actually, I've gotten the first half of the code to work now.   Here's
>> how
>>> it looks:
>>> 
>>> 
>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>> (user:chararray, time:long, query:chararray);
>>> Z = FILTER X BY query is not null;
>>> A = GROUP Z BY query;
>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>> C = GROUP B ALL;
>>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
>> max:long;
>>> V = FOREACH U GENERATE min + (max-min)*0.95;
>>> 
>>> V is appropriately set... but I can't perform the final step of actually
>>> filtering the values by that count.
>>> 
>>> The simple approach:
>>> TOP = FILTER B BY count > V;
>>> 
>>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
>>> {query: chararray,count: long}
>>> 
>>> Same with SPLIT.
>>> 
>>> How do I filter or split on the value of V?
>>> 
>>> Dave Viner
>>> 
>>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:
>>> 
>>>> I don't quite understand this pig latin.  The piggybank
>>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
>>>> which are compared.  Here's the sample I'm trying using the tutorial
>>>> excitelog as a sample.
>>>> 
>>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>>> (user:chararray, time:long, query:chararray);
>>>> Z = FILTER X BY query is not null;
>>>> A = GROUP Z BY query;
>>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>>> U = FOREACH B GENERATE *,
>>>>     MIN(count) as min:long,
>>>>     MAX(count) as max:long;
>>>> 
>>>> This doesn't seem to work at all.  It dies with this error:
>>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
>>>> Other Field Schema: min: long
>>>> 
>>>> Changing the min:long and max:long to doubles (as suggested by the error
>>>> message), causes this error:
>>>> ERROR 1045: Could not infer the matching function for
>>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
>> an
>>>> explicit cast.
>>>> 
>>>> What am I missing in using the sample code you've provided?  I can't
>> seem
>>>> to get it to work...
>>>> 
>>>> Thanks for your help.
>>>> Dave Viner
>>>> 
>>>> 
>>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
>>>> 
>>>>> That's what I tried to say in my last email.I don't believe you can
>>>>> calculate exactly the percentiles in just one pass. Writing out the pig
>>>>> for
>>>>> two pass algorithm should be easy enough..
>>>>> 
>>>>> P = group TABLE all;
>>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>>>>> V = foreach U generate min + (max-min)*0.95;
>>>>> 
>>>>> would give you the 95th percentile cutoff, and u just filter or split
>> by
>>>>> V.
>>>>> 
>>>>> 
>>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
>> wrote:
>>>>> 
>>>>>> How would I calculate the percentile in one pass?  In order to
>> calculate
>>>>>> the
>>>>>> percentile for each item, I need to know the total count.  How do I
>> get
>>>>> the
>>>>>> total count, and then calculate each item's percentile in one pass?
>>>>>> 
>>>>>> I don't mind doing multiple passes - I am just not sure how to make
>> the
>>>>>> calculation.
>>>>>> 
>>>>>> Thanks
>>>>>> Dave Viner
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
>>>>>> 
>>>>>>> I think it's impossible to do this within one M/R. You will want to
>>>>>>> implement it in two M/R in Pig, because you have to calculate the
>>>>>>> percentile
>>>>>>> in pass 1, and then perform the filter in pass 2.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Is there a UDF for generating the top X % of results?  For example,
>>>>> in
>>>>>> a
>>>>>>>> log
>>>>>>>> parsing context, it might be the set of search queries that
>>>>> represent
>>>>>> the
>>>>>>>> top 80% of all queries.
>>>>>>>> 
>>>>>>>> I see in the piggybank that there is a TOP function, but that only
>>>>>> takes
>>>>>>>> the
>>>>>>>> top *number* of results, rather a percentile.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dave Viner
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to