ahh, I see, i seem to have misread the question, if it's top xx% entries, then certainly sorting and then limiting.
@Thejas I had thought that Limit is distributed and does not guarantee u get the results in order ?? @Dave here's what to do after you get the min + (max-min)*.95 into V which now has just one row. P = group TABLE all; U = foreach P generate MIN(x) as min, MAX(x) as max; V = foreach U generate min + (max-min)*0.95 as v; D = cross TABLE, V; E = filter D by D.x <= V.v; We should really start a wiki page to store these common calculations in Pig... I'd imagine a lot of people out there needed to do this at one time or another. This is also why, imnsho, I'd like to see cross as a nested_op inside foreach, because of this type of application. On Tue, Jun 29, 2010 at 6:42 PM, Dave Viner <[email protected]> wrote: > Ok. I think I see how it's working. > > As a side note to this conversation, the wiki at > http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what > I > was looking for *should* be possible with this code: > > A = load 'data1'; > B = group A all; > C = foreach B generate COUNT(B); > store C into 'count'; > D = load 'data2'; > E = foreach D generate $1/GetScalar(C); > > > However, GetScalar is no longer present in 0.7.0. > > Thanks for your help. > > Dave Viner > > > > On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair <[email protected]> wrote: > > > > > > > > > On 6/29/10 4:22 PM, "Dave Viner" <[email protected]> wrote: > > > > > Hi Thejas, > > > > > > Thanks for this suggestion. Just to make sure I understand the > strategy > > > here: > > > > > > Pig query 1 outputs 2 files (or directories w/ files in them): > > > "file.ordered" which contains the data set (query, count) in ascending > > order > > > of count; and "total_rows" which contains 1 line with the total number > of > > > rows. > > > > > > Then, I need a small script that reads the value of "total_rows", > > multiplies > > > that value by the desired percentile (0.95 in the example), and invokes > > pig > > > a second time, passing the "TOP_ROWS" value as a define on the command > > line. > > > The second pig query outputs the desired file in "output" which > contains > > > only those rows which have a count above the given threshold. > > > > > > > > > Yes, you are right. But the "output" contains 0.95 * total_num_rows, it > is > > not all rows that have a value of count above given threshold. So it is > > possible that a few rows that are not in output have same value of count > as > > the last result in "output" . > > For example if input has 100 rows with 1 as the count value, you will > any > > 95 rows in the output. > > > > -Thejas > > > > > > > > > Is that correct? > > > > > > Thanks > > > Dave Viner > > > > > > > > > > > > On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <[email protected]> > > wrote: > > > > > >> This pig query will not give you the "top xx% of results" . Ie if xx = > > 50% > > >> and results = 200, you may not get 100 rows in output. > > >> If you really want "top xx% of results", you need to do something like > > the > > >> following - > > >> (if you just want rows that have "count < (max(count) - > > min(count))*0.95, > > >> you can store and use the value of V the way TOTAL_ROWS is stored in > > this > > >> example) > > >> > > >> Pig query 1 > > >> ----------- > > >> L = load 'file' as (query, count); -- assuming that count is already > > there > > >> for simplicity > > >> O = order L by count; -- order it by column that defines 'top' > > >> G = group L by 'all'; > > >> TOTAL_ROWS = foreach G generate COUNT(G); > > >> store O into 'file.ordered'; > > >> > > >> -- at present, there is no way to use the 'TOTAL_ROWS' directly to > > filter > > >> the results > > >> store TOTAL_ROWS into 'total_rows.txt'; > > >> > > >> > > >> > > >> Pig query 2 > > >> ----------- > > >> -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X > > is > > >> the result stored in total_rows.txt > > >> L = load 'file.ordered'; > > >> LIM = limit L, $TOP_ROWS; > > >> Store LIM into 'output'; > > >> > > >> > > >> > > >> On 6/29/10 3:24 PM, "Dave Viner" <[email protected]> wrote: > > >> > > >>> Actually, I've gotten the first half of the code to work now. > Here's > > >> how > > >>> it looks: > > >>> > > >>> > > >>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > > >>> (user:chararray, time:long, query:chararray); > > >>> Z = FILTER X BY query is not null; > > >>> A = GROUP Z BY query; > > >>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as > > count:long; > > >>> C = GROUP B ALL; > > >>> U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as > > >> max:long; > > >>> V = FOREACH U GENERATE min + (max-min)*0.95; > > >>> > > >>> V is appropriately set... but I can't perform the final step of > > actually > > >>> filtering the values by that count. > > >>> > > >>> The simple approach: > > >>> TOP = FILTER B BY count > V; > > >>> > > >>> doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in > > >>> {query: chararray,count: long} > > >>> > > >>> Same with SPLIT. > > >>> > > >>> How do I filter or split on the value of V? > > >>> > > >>> Dave Viner > > >>> > > >>> On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <[email protected]> > > wrote: > > >>> > > >>>> I don't quite understand this pig latin. The piggybank > > >>>> function org.apache.pig.piggybank.evaluation.math.MIN takes 2 > > parameters > > >>>> which are compared. Here's the sample I'm trying using the tutorial > > >>>> excitelog as a sample. > > >>>> > > >>>> X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS > > >>>> (user:chararray, time:long, query:chararray); > > >>>> Z = FILTER X BY query is not null; > > >>>> A = GROUP Z BY query; > > >>>> B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as > > count:long; > > >>>> U = FOREACH B GENERATE *, > > >>>> MIN(count) as min:long, > > >>>> MAX(count) as max:long; > > >>>> > > >>>> This doesn't seem to work at all. It dies with this error: > > >>>> ERROR 1022: Type mismatch merging schema prefix. Field Schema: > double. > > >>>> Other Field Schema: min: long > > >>>> > > >>>> Changing the min:long and max:long to doubles (as suggested by the > > error > > >>>> message), causes this error: > > >>>> ERROR 1045: Could not infer the matching function for > > >>>> org.apache.pig.builtin.MIN as multiple or none of them fit. Please > use > > >> an > > >>>> explicit cast. > > >>>> > > >>>> What am I missing in using the sample code you've provided? I can't > > >> seem > > >>>> to get it to work... > > >>>> > > >>>> Thanks for your help. > > >>>> Dave Viner > > >>>> > > >>>> > > >>>> On Tue, Jun 29, 2010 at 10:17 AM, hc busy <[email protected]> > wrote: > > >>>> > > >>>>> That's what I tried to say in my last email.I don't believe you can > > >>>>> calculate exactly the percentiles in just one pass. Writing out the > > pig > > >>>>> for > > >>>>> two pass algorithm should be easy enough.. > > >>>>> > > >>>>> P = group TABLE all; > > >>>>> U = foreach P generate MIN(x) as min, MAX(x) as max; > > >>>>> V = foreach U generate min + (max-min)*0.95; > > >>>>> > > >>>>> would give you the 95th percentile cutoff, and u just filter or > split > > >> by > > >>>>> V. > > >>>>> > > >>>>> > > >>>>> On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <[email protected]> > > >> wrote: > > >>>>> > > >>>>>> How would I calculate the percentile in one pass? In order to > > >> calculate > > >>>>>> the > > >>>>>> percentile for each item, I need to know the total count. How do > I > > >> get > > >>>>> the > > >>>>>> total count, and then calculate each item's percentile in one > pass? > > >>>>>> > > >>>>>> I don't mind doing multiple passes - I am just not sure how to > make > > >> the > > >>>>>> calculation. > > >>>>>> > > >>>>>> Thanks > > >>>>>> Dave Viner > > >>>>>> > > >>>>>> > > >>>>>> On Tue, Jun 29, 2010 at 9:59 AM, hc busy <[email protected]> > wrote: > > >>>>>> > > >>>>>>> I think it's impossible to do this within one M/R. You will want > to > > >>>>>>> implement it in two M/R in Pig, because you have to calculate the > > >>>>>>> percentile > > >>>>>>> in pass 1, and then perform the filter in pass 2. > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <[email protected]> > > >>>>> wrote: > > >>>>>>> > > >>>>>>>> Is there a UDF for generating the top X % of results? For > > example, > > >>>>> in > > >>>>>> a > > >>>>>>>> log > > >>>>>>>> parsing context, it might be the set of search queries that > > >>>>> represent > > >>>>>> the > > >>>>>>>> top 80% of all queries. > > >>>>>>>> > > >>>>>>>> I see in the piggybank that there is a TOP function, but that > only > > >>>>>> takes > > >>>>>>>> the > > >>>>>>>> top *number* of results, rather a percentile. > > >>>>>>>> > > >>>>>>>> Thanks > > >>>>>>>> Dave Viner > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >> > > >> > > > > >
