Ok. I think I see how it's working. As a side note to this conversation, the wiki at http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what I was looking for *should* be possible with this code:
A = load 'data1'; B = group A all; C = foreach B generate COUNT(B); store C into 'count'; D = load 'data2'; E = foreach D generate $1/GetScalar(C); However, GetScalar is no longer present in 0.7.0. Thanks for your help. Dave Viner On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <[email protected]> wrote: > > > > On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote: > > > Hi Thejas, > > > > Thanks for this suggestion. Just to make sure I understand the strategy > > here: > > > > Pig query 1 outputs 2 files (or directories w/ files in them): > > "file.ordered" which contains the data set (query, count) in ascending > order > > of count; and "total_rows" which contains 1 line with the total number of > > rows. > > > > Then, I need a small script that reads the value of "total_rows", > multiplies > > that value by the desired percentile (0.95 in the example), and invokes > pig > > a second time, passing the "TOP_ROWS" value as a define on the command > line. > > The second pig query outputs the desired file in "output" which contains > > only those rows which have a count above the given threshold. > > > > > Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is > not all rows that have a value of count above given threshold. So it is > possible that a few rows that are not in output have same value of count as > the last result in "output" . > For example if input has 100 rows with 1 as the count value, you will any > 95 rows in the output. > > -Thejas > > > > > Is that correct? > > > > Thanks > > Dave Viner > > > > > > > > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> > wrote: > > > >> This pig query will not give you the "top xx% of results" . Ie if xx = > 50% > >> and results = 200, you may not get 100 rows in output. > >> If you really want "top xx% of results", you need to do something like > the > >> following - > >> (if you just want rows that have "count < (max(count) - > min(count))*0.95, > >> you can store and use the value of V the way TOTAL_ROWS is stored in > this > >> example) > >> > >> Pig query 1 > >> ----------- > >> L = load 'file' as (query, count); -- assuming that count is already > there > >> for simplicity > >> O = order L by count; -- order it by column that defines 'top' > >> G = group L by 'all'; > >> TOTAL_ROWS = foreach G generate COUNT(G); > >> store O into 'file.ordered'; > >> > >> -- at present, there is no way to use the 'TOTAL_ROWS' directly to > filter > >> the results > >> store TOTAL_ROWS into 'total_rows.txt'; > >> > >> > >> > >> Pig query 2 > >> ----------- > >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X > is > >> the result stored in total_rows.txt > >> L = load 'file.ordered'; > >> LIM = limit L, $TOP_ROWS; > >> Store LIM into 'output'; > >> > >> > >> > >> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote: > >> > >>> Actually, I've gotten the first half of the code to work now. Here's > >> how > >>> it looks: > >>> > >>> > >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > >>> (user:chararray, time:long, query:chararray); > >>> Z = FILTER X BY query is not null; > >>> A = GROUP Z BY query; > >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as > count:long; > >>> C = GROUP B ALL; > >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as > >> max:long; > >>> V = FOREACH U GENERATE min + (max-min)*0.95; > >>> > >>> V is appropriately set... but I can't perform the final step of > actually > >>> filtering the values by that count. > >>> > >>> The simple approach: > >>> TOP = FILTER B BY count > V; > >>> > >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in > >>> {query: chararray,count: long} > >>> > >>> Same with SPLIT. > >>> > >>> How do I filter or split on the value of V? > >>> > >>> Dave Viner > >>> > >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> > wrote: > >>> > >>>> I don't quite understand this pig latin. The piggybank > >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 > parameters > >>>> which are compared. Here's the sample I'm trying using the tutorial > >>>> excitelog as a sample. > >>>> > >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > >>>> (user:chararray, time:long, query:chararray); > >>>> Z = FILTER X BY query is not null; > >>>> A = GROUP Z BY query; > >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as > count:long; > >>>> U = FOREACH B GENERATE *, > >>>> MIN(count) as min:long, > >>>> MAX(count) as max:long; > >>>> > >>>> This doesn't seem to work at all. It dies with this error: > >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. > >>>> Other Field Schema: min: long > >>>> > >>>> Changing the min:long and max:long to doubles (as suggested by the > error > >>>> message), causes this error: > >>>> ERROR 1045: Could not infer the matching function for > >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please use > >> an > >>>> explicit cast. > >>>> > >>>> What am I missing in using the sample code you've provided? I can't > >> seem > >>>> to get it to work... > >>>> > >>>> Thanks for your help. > >>>> Dave Viner > >>>> > >>>> > >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> wrote: > >>>> > >>>>> That's what I tried to say in my last email.I don't believe you can > >>>>> calculate exactly the percentiles in just one pass. Writing out the > pig > >>>>> for > >>>>> two pass algorithm should be easy enough.. > >>>>> > >>>>> P = group TABLE all; > >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max; > >>>>> V = foreach U generate min + (max-min)*0.95; > >>>>> > >>>>> would give you the 95th percentile cutoff, and u just filter or split > >> by > >>>>> V. > >>>>> > >>>>> > >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> > >> wrote: > >>>>> > >>>>>> How would I calculate the percentile in one pass? In order to > >> calculate > >>>>>> the > >>>>>> percentile for each item, I need to know the total count. How do I > >> get > >>>>> the > >>>>>> total count, and then calculate each item's percentile in one pass? > >>>>>> > >>>>>> I don't mind doing multiple passes - I am just not sure how to make > >> the > >>>>>> calculation. > >>>>>> > >>>>>> Thanks > >>>>>> Dave Viner > >>>>>> > >>>>>> > >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> wrote: > >>>>>> > >>>>>>> I think it's impossible to do this within one M/R. You will want to > >>>>>>> implement it in two M/R in Pig, because you have to calculate the > >>>>>>> percentile > >>>>>>> in pass 1, and then perform the filter in pass 2. > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> > >>>>> wrote: > >>>>>>> > >>>>>>>> Is there a UDF for generating the top X % of results? For > example, > >>>>> in > >>>>>> a > >>>>>>>> log > >>>>>>>> parsing context, it might be the set of search queries that > >>>>> represent > >>>>>> the > >>>>>>>> top 80% of all queries. > >>>>>>>> > >>>>>>>> I see in the piggybank that there is a TOP function, but that only > >>>>>> takes > >>>>>>>> the > >>>>>>>> top *number* of results, rather a percentile. > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Dave Viner > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >> > >> > >
