Hi Thejas, Thanks for this suggestion. Just to make sure I understand the strategy here:
Pig query 1 outputs 2 files (or directories w/ files in them): "file.ordered" which contains the data set (query, count) in ascending order of count; and "total_rows" which contains 1 line with the total number of rows. Then, I need a small script that reads the value of "total_rows", multiplies that value by the desired percentile (0.95 in the example), and invokes pig a second time, passing the "TOP_ROWS" value as a define on the command line. The second pig query outputs the desired file in "output" which contains only those rows which have a count above the given threshold. Is that correct? Thanks Dave Viner On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> wrote: > This pig query will not give you the "top xx% of results" . Ie if xx = 50% > and results = 200, you may not get 100 rows in output. > If you really want "top xx% of results", you need to do something like the > following - > (if you just want rows that have "count < (max(count) - min(count))*0.95, > you can store and use the value of V the way TOTAL_ROWS is stored in this > example) > > Pig query 1 > ----------- > L = load 'file' as (query, count); -- assuming that count is already there > for simplicity > O = order L by count; -- order it by column that defines 'top' > G = group L by 'all'; > TOTAL_ROWS = foreach G generate COUNT(G); > store O into 'file.ordered'; > > -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter > the results > store TOTAL_ROWS into 'total_rows.txt'; > > > > Pig query 2 > ----------- > -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is > the result stored in total_rows.txt > L = load 'file.ordered'; > LIM = limit L, $TOP_ROWS; > Store LIM into 'output'; > > > > On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote: > > > Actually, I've gotten the first half of the code to work now. Here's > how > > it looks: > > > > > > X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > > (user:chararray, time:long, query:chararray); > > Z = FILTER X BY query is not null; > > A = GROUP Z BY query; > > B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long; > > C = GROUP B ALL; > > U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as > max:long; > > V = FOREACH U GENERATE min + (max-min)*0.95; > > > > V is appropriately set... but I can't perform the final step of actually > > filtering the values by that count. > > > > The simple approach: > > TOP = FILTER B BY count > V; > > > > doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in > > {query: chararray,count: long} > > > > Same with SPLIT. > > > > How do I filter or split on the value of V? > > > > Dave Viner > > > > On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> wrote: > > > >> I don't quite understand this pig latin. The piggybank > >> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters > >> which are compared. Here's the sample I'm trying using the tutorial > >> excitelog as a sample. > >> > >> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > >> (user:chararray, time:long, query:chararray); > >> Z = FILTER X BY query is not null; > >> A = GROUP Z BY query; > >> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long; > >> U = FOREACH B GENERATE *, > >> MIN(count) as min:long, > >> MAX(count) as max:long; > >> > >> This doesn't seem to work at all. It dies with this error: > >> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. > >> Other Field Schema: min: long > >> > >> Changing the min:long and max:long to doubles (as suggested by the error > >> message), causes this error: > >> ERROR 1045: Could not infer the matching function for > >> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use > an > >> explicit cast. > >> > >> What am I missing in using the sample code you've provided? I can't > seem > >> to get it to work... > >> > >> Thanks for your help. > >> Dave Viner > >> > >> > >> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote: > >> > >>> That's what I tried to say in my last email.I don't believe you can > >>> calculate exactly the percentiles in just one pass. Writing out the pig > >>> for > >>> two pass algorithm should be easy enough.. > >>> > >>> P = group TABLE all; > >>> U = foreach P generate MIN(x) as min, MAX(x) as max; > >>> V = foreach U generate min + (max-min)*0.95; > >>> > >>> would give you the 95th percentile cutoff, and u just filter or split > by > >>> V. > >>> > >>> > >>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> > wrote: > >>> > >>>> How would I calculate the percentile in one pass? In order to > calculate > >>>> the > >>>> percentile for each item, I need to know the total count. How do I > get > >>> the > >>>> total count, and then calculate each item's percentile in one pass? > >>>> > >>>> I don't mind doing multiple passes - I am just not sure how to make > the > >>>> calculation. > >>>> > >>>> Thanks > >>>> Dave Viner > >>>> > >>>> > >>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote: > >>>> > >>>>> I think it's impossible to do this within one M/R. You will want to > >>>>> implement it in two M/R in Pig, because you have to calculate the > >>>>> percentile > >>>>> in pass 1, and then perform the filter in pass 2. > >>>>> > >>>>> > >>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> > >>> wrote: > >>>>> > >>>>>> Is there a UDF for generating the top X % of results? For example, > >>> in > >>>> a > >>>>>> log > >>>>>> parsing context, it might be the set of search queries that > >>> represent > >>>> the > >>>>>> top 80% of all queries. > >>>>>> > >>>>>> I see in the piggybank that there is a TOP function, but that only > >>>> takes > >>>>>> the > >>>>>> top *number* of results, rather a percentile. > >>>>>> > >>>>>> Thanks > >>>>>> Dave Viner > >>>>>> > >>>>> > >>>> > >>> > >> > >> > >
