I think you could turn that inside out and do the counting first by grouping on both fields and then do the top-n by grouping on field1. I would cautiously expect that to be a bit faster.
On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <[email protected]> wrote: > Let me try and rephrase by question. > I have a set of tuples of the form (field1, field2). I need to group by > 'field1' and then sub-group by 'field2' and output top-k instances of > field2 for field1. What's the right way of doing that in pig? > > What I did was grouped my tuples by 'field1' and passed the DataBag to > my UDF - top() which just counts the occurrence of each tuple and > outputs top-K. > This worked but it didn't look like the most efficient solution. > > Can anyone suggest something different? > > Thanks > -Ankur > > -----Original Message----- > From: Goel, Ankur [mailto:[email protected]] > Sent: Thursday, January 08, 2009 3:03 PM > To: [email protected]; [email protected] > Subject: Top-K for nested fields > > Hi Folks, > > I have a case where-in I need to do top-K on nested fields > in my tuple. For e.g. Consider the following tuples (format is [url, > query]) > > (abc.com, A) > > (abc.com, A) > > (abc.com, C) > > (abc.com, B) > > (xyz.com, D) > > (xyz.com, D) > > (xyz.com, E) > > > > I need to be able to group by URL and output top-K queries along with > their count for each URL. So output would be > > Abc.com A 2 > > Abc.com B 1 > > Abc.com C 1 > > > > > > In my understanding we would do something like > > > > url = GROUP tuples BY url; > > result = FOREACH url GENERATE group, top(10, query) > > > > Is there a UDF to do this? If not then I can write one and possibly > contribute. > > > > Is there any other way of doing it? > > > > Thanks > > -Ankur > > -- Ted Dunning, CTO DeepDyve 4600 Bohannon Drive, Suite 220 Menlo Park, CA 94025 www.deepdyve.com 650-324-0110, ext. 738 858-414-0013 (m)
