Actually, I've gotten the first half of the code to work now.   Here's how
it looks:


X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner

On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote:

> I don't quite understand this pig latin.  The piggybank
> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
> which are compared.  Here's the sample I'm trying using the tutorial
> excitelog as a sample.
>
> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
> (user:chararray, time:long, query:chararray);
> Z = FILTER X BY query is not null;
> A = GROUP Z BY query;
> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
> U = FOREACH B GENERATE *,
>     MIN(count) as min:long,
>     MAX(count) as max:long;
>
> This doesn't seem to work at all.  It dies with this error:
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
> Other Field Schema: min: long
>
> Changing the min:long and max:long to doubles (as suggested by the error
> message), causes this error:
> ERROR 1045: Could not infer the matching function for
> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
> explicit cast.
>
> What am I missing in using the sample code you've provided?  I can't seem
> to get it to work...
>
> Thanks for your help.
> Dave Viner
>
>
> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote:
>
>> That's what I tried to say in my last email.I don't believe you can
>> calculate exactly the percentiles in just one pass. Writing out the pig
>> for
>> two pass algorithm should be easy enough..
>>
>> P = group TABLE all;
>> U = foreach P generate MIN(x) as min, MAX(x) as max;
>> V = foreach U generate min + (max-min)*0.95;
>>
>> would give you the 95th percentile cutoff, and u just filter or split by
>> V.
>>
>>
>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> wrote:
>>
>> > How would I calculate the percentile in one pass?  In order to calculate
>> > the
>> > percentile for each item, I need to know the total count.  How do I get
>> the
>> > total count, and then calculate each item's percentile in one pass?
>> >
>> > I don't mind doing multiple passes - I am just not sure how to make the
>> > calculation.
>> >
>> > Thanks
>> > Dave Viner
>> >
>> >
>> > On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote:
>> >
>> > > I think it's impossible to do this within one M/R. You will want to
>> > > implement it in two M/R in Pig, because you have to calculate the
>> > > percentile
>> > > in pass 1, and then perform the filter in pass 2.
>> > >
>> > >
>> > > On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]>
>> wrote:
>> > >
>> > > > Is there a UDF for generating the top X % of results?  For example,
>> in
>> > a
>> > > > log
>> > > > parsing context, it might be the set of search queries that
>> represent
>> > the
>> > > > top 80% of all queries.
>> > > >
>> > > > I see in the piggybank that there is a TOP function, but that only
>> > takes
>> > > > the
>> > > > top *number* of results, rather a percentile.
>> > > >
>> > > > Thanks
>> > > > Dave Viner
>> > > >
>> > >
>> >
>>
>
>

Reply via email to