Let me try and rephrase by question.
I have a set of tuples of the form (field1, field2). I need to group by
'field1' and then sub-group by 'field2' and output top-k instances of
field2 for field1. What's the right way of doing that in pig?

What I did was grouped my tuples by 'field1' and passed the DataBag to
my UDF - top() which just counts the occurrence of each tuple and
outputs top-K.
This worked but it didn't look like the most efficient solution.

Can anyone suggest something different?

Thanks
-Ankur

-----Original Message-----
From: Goel, Ankur [mailto:[email protected]] 
Sent: Thursday, January 08, 2009 3:03 PM
To: [email protected]; [email protected]
Subject: Top-K for nested fields

Hi Folks,

              I have a case where-in I need to do top-K on nested fields
in my tuple. For e.g. Consider the following tuples (format is [url,
query])

(abc.com, A)

(abc.com, A)

(abc.com, C)

(abc.com, B)

(xyz.com, D)

(xyz.com, D)

(xyz.com, E)

 

I need to be able to group by URL and output top-K queries along with
their count for each URL. So output would be 

Abc.com A 2

Abc.com B 1

Abc.com C 1

 

 

In my understanding we would do something like

 

url = GROUP tuples BY url;

result = FOREACH url GENERATE group, top(10, query)

 

Is there a UDF to do this? If not then I can write one and possibly
contribute.

 

Is there any other way of doing it?

 

Thanks

-Ankur

Reply via email to