This pig query will not give you the "top xx% of results" . Ie if xx = 50% and results = 200, you may not get 100 rows in output. If you really want "top xx% of results", you need to do something like the following - (if you just want rows that have "count < (max(count) - min(count))*0.95, you can store and use the value of V the way TOTAL_ROWS is stored in this example)
Pig query 1 ----------- L = load 'file' as (query, count); -- assuming that count is already there for simplicity O = order L by count; -- order it by column that defines 'top' G = group L by 'all'; TOTAL_ROWS = foreach G generate COUNT(G); store O into 'file.ordered'; -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter the results store TOTAL_ROWS into 'total_rows.txt'; Pig query 2 ----------- -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is the result stored in total_rows.txt L = load 'file.ordered'; LIM = limit L, $TOP_ROWS; Store LIM into 'output'; On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote: > Actually, I've gotten the first half of the code to work now. Here's how > it looks: > > > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > (user:chararray, time:long, query:chararray); > Z = FILTER X BY query is not null; > A = GROUP Z BY query; > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long; > C = GROUP B ALL; > U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long; > V = FOREACH U GENERATE min + (max-min)*0.95; > > V is appropriately set... but I can't perform the final step of actually > filtering the values by that count. > > The simple approach: > TOP = FILTER B BY count > V; > > doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in > {query: chararray,count: long} > > Same with SPLIT. > > How do I filter or split on the value of V? > > Dave Viner > > On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote: > >> I don't quite understand this pig latin. The piggybank >> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters >> which are compared. Here's the sample I'm trying using the tutorial >> excitelog as a sample. >> >> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS >> (user:chararray, time:long, query:chararray); >> Z = FILTER X BY query is not null; >> A = GROUP Z BY query; >> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long; >> U = FOREACH B GENERATE *, >> MIN(count) as min:long, >> MAX(count) as max:long; >> >> This doesn't seem to work at all. It dies with this error: >> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. >> Other Field Schema: min: long >> >> Changing the min:long and max:long to doubles (as suggested by the error >> message), causes this error: >> ERROR 1045: Could not infer the matching function for >> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an >> explicit cast. >> >> What am I missing in using the sample code you've provided? I can't seem >> to get it to work... >> >> Thanks for your help. >> Dave Viner >> >> >> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote: >> >>> That's what I tried to say in my last email.I don't believe you can >>> calculate exactly the percentiles in just one pass. Writing out the pig >>> for >>> two pass algorithm should be easy enough.. >>> >>> P = group TABLE all; >>> U = foreach P generate MIN(x) as min, MAX(x) as max; >>> V = foreach U generate min + (max-min)*0.95; >>> >>> would give you the 95th percentile cutoff, and u just filter or split by >>> V. >>> >>> >>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> wrote: >>> >>>> How would I calculate the percentile in one pass? In order to calculate >>>> the >>>> percentile for each item, I need to know the total count. How do I get >>> the >>>> total count, and then calculate each item's percentile in one pass? >>>> >>>> I don't mind doing multiple passes - I am just not sure how to make the >>>> calculation. >>>> >>>> Thanks >>>> Dave Viner >>>> >>>> >>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote: >>>> >>>>> I think it's impossible to do this within one M/R. You will want to >>>>> implement it in two M/R in Pig, because you have to calculate the >>>>> percentile >>>>> in pass 1, and then perform the filter in pass 2. >>>>> >>>>> >>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> >>> wrote: >>>>> >>>>>> Is there a UDF for generating the top X % of results? For example, >>> in >>>> a >>>>>> log >>>>>> parsing context, it might be the set of search queries that >>> represent >>>> the >>>>>> top 80% of all queries. >>>>>> >>>>>> I see in the piggybank that there is a TOP function, but that only >>>> takes >>>>>> the >>>>>> top *number* of results, rather a percentile. >>>>>> >>>>>> Thanks >>>>>> Dave Viner >>>>>> >>>>> >>>> >>> >> >>
