Hi Folks,
I have a case where-in I need to do top-K on nested fields
in my tuple. For e.g. Consider the following tuples (format is [url,
query])
(abc.com, A)
(abc.com, A)
(abc.com, C)
(abc.com, B)
(xyz.com, D)
(xyz.com, D)
(xyz.com, E)
I need to be able to group by URL and output top-K queries along with
their count for each URL. So output would be
Abc.com A 2
Abc.com B 1
Abc.com C 1
In my understanding we would do something like
url = GROUP tuples BY url;
result = FOREACH url GENERATE group, top(10, query)
Is there a UDF to do this? If not then I can write one and possibly
contribute.
Is there any other way of doing it?
Thanks
-Ankur