Let me try and rephrase by question. I have a set of tuples of the form (field1, field2). I need to group by 'field1' and then sub-group by 'field2' and output top-k instances of field2 for field1. What's the right way of doing that in pig?
What I did was grouped my tuples by 'field1' and passed the DataBag to my UDF - top() which just counts the occurrence of each tuple and outputs top-K. This worked but it didn't look like the most efficient solution. Can anyone suggest something different? Thanks -Ankur -----Original Message----- From: Goel, Ankur [mailto:[email protected]] Sent: Thursday, January 08, 2009 3:03 PM To: [email protected]; [email protected] Subject: Top-K for nested fields Hi Folks, I have a case where-in I need to do top-K on nested fields in my tuple. For e.g. Consider the following tuples (format is [url, query]) (abc.com, A) (abc.com, A) (abc.com, C) (abc.com, B) (xyz.com, D) (xyz.com, D) (xyz.com, E) I need to be able to group by URL and output top-K queries along with their count for each URL. So output would be Abc.com A 2 Abc.com B 1 Abc.com C 1 In my understanding we would do something like url = GROUP tuples BY url; result = FOREACH url GENERATE group, top(10, query) Is there a UDF to do this? If not then I can write one and possibly contribute. Is there any other way of doing it? Thanks -Ankur
