On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote:
> Hi Thejas,
>
> Thanks for this suggestion. Just to make sure I understand the strategy
> here:
>
> Pig query 1 outputs 2 files (or directories w/ files in them):
> "file.ordered" which contains the data set (query, count) in ascending order
> of count; and "total_rows" which contains 1 line with the total number of
> rows.
>
> Then, I need a small script that reads the value of "total_rows", multiplies
> that value by the desired percentile (0.95 in the example), and invokes pig
> a second time, passing the "TOP_ROWS" value as a define on the command line.
> The second pig query outputs the desired file in "output" which contains
> only those rows which have a count above the given threshold.
>
Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.
-Thejas
> Is that correct?
>
> Thanks
> Dave Viner
>
>
>
> On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> wrote:
>
>> This pig query will not give you the "top xx% of results" . Ie if xx = 50%
>> and results = 200, you may not get 100 rows in output.
>> If you really want "top xx% of results", you need to do something like the
>> following -
>> (if you just want rows that have "count < (max(count) - min(count))*0.95,
>> you can store and use the value of V the way TOTAL_ROWS is stored in this
>> example)
>>
>> Pig query 1
>> -----------
>> L = load 'file' as (query, count); -- assuming that count is already there
>> for simplicity
>> O = order L by count; -- order it by column that defines 'top'
>> G = group L by 'all';
>> TOTAL_ROWS = foreach G generate COUNT(G);
>> store O into 'file.ordered';
>>
>> -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
>> the results
>> store TOTAL_ROWS into 'total_rows.txt';
>>
>>
>>
>> Pig query 2
>> -----------
>> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
>> the result stored in total_rows.txt
>> L = load 'file.ordered';
>> LIM = limit L, $TOP_ROWS;
>> Store LIM into 'output';
>>
>>
>>
>> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote:
>>
>>> Actually, I've gotten the first half of the code to work now. Here's
>> how
>>> it looks:
>>>
>>>
>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>> (user:chararray, time:long, query:chararray);
>>> Z = FILTER X BY query is not null;
>>> A = GROUP Z BY query;
>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>> C = GROUP B ALL;
>>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as
>> max:long;
>>> V = FOREACH U GENERATE min + (max-min)*0.95;
>>>
>>> V is appropriately set... but I can't perform the final step of actually
>>> filtering the values by that count.
>>>
>>> The simple approach:
>>> TOP = FILTER B BY count > V;
>>>
>>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
>>> {query: chararray,count: long}
>>>
>>> Same with SPLIT.
>>>
>>> How do I filter or split on the value of V?
>>>
>>> Dave Viner
>>>
>>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:
>>>
>>>> I don't quite understand this pig latin. The piggybank
>>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
>>>> which are compared. Here's the sample I'm trying using the tutorial
>>>> excitelog as a sample.
>>>>
>>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
>>>> (user:chararray, time:long, query:chararray);
>>>> Z = FILTER X BY query is not null;
>>>> A = GROUP Z BY query;
>>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
>>>> U = FOREACH B GENERATE *,
>>>> MIN(count) as min:long,
>>>> MAX(count) as max:long;
>>>>
>>>> This doesn't seem to work at all. It dies with this error:
>>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
>>>> Other Field Schema: min: long
>>>>
>>>> Changing the min:long and max:long to doubles (as suggested by the error
>>>> message), causes this error:
>>>> ERROR 1045: Could not infer the matching function for
>>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
>> an
>>>> explicit cast.
>>>>
>>>> What am I missing in using the sample code you've provided? I can't
>> seem
>>>> to get it to work...
>>>>
>>>> Thanks for your help.
>>>> Dave Viner
>>>>
>>>>
>>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
>>>>
>>>>> That's what I tried to say in my last email.I don't believe you can
>>>>> calculate exactly the percentiles in just one pass. Writing out the pig
>>>>> for
>>>>> two pass algorithm should be easy enough..
>>>>>
>>>>> P = group TABLE all;
>>>>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>>>>> V = foreach U generate min + (max-min)*0.95;
>>>>>
>>>>> would give you the 95th percentile cutoff, and u just filter or split
>> by
>>>>> V.
>>>>>
>>>>>
>>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]>
>> wrote:
>>>>>
>>>>>> How would I calculate the percentile in one pass? In order to
>> calculate
>>>>>> the
>>>>>> percentile for each item, I need to know the total count. How do I
>> get
>>>>> the
>>>>>> total count, and then calculate each item's percentile in one pass?
>>>>>>
>>>>>> I don't mind doing multiple passes - I am just not sure how to make
>> the
>>>>>> calculation.
>>>>>>
>>>>>> Thanks
>>>>>> Dave Viner
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
>>>>>>
>>>>>>> I think it's impossible to do this within one M/R. You will want to
>>>>>>> implement it in two M/R in Pig, because you have to calculate the
>>>>>>> percentile
>>>>>>> in pass 1, and then perform the filter in pass 2.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
>>>>> wrote:
>>>>>>>
>>>>>>>> Is there a UDF for generating the top X % of results? For example,
>>>>> in
>>>>>> a
>>>>>>>> log
>>>>>>>> parsing context, it might be the set of search queries that
>>>>> represent
>>>>>> the
>>>>>>>> top 80% of all queries.
>>>>>>>>
>>>>>>>> I see in the piggybank that there is a TOP function, but that only
>>>>>> takes
>>>>>>>> the
>>>>>>>> top *number* of results, rather a percentile.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Dave Viner
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>>