I think you could turn that inside out and do the counting first by grouping
on both fields and then do the top-n by grouping on field1.  I would
cautiously expect that to be a bit faster.

On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <[email protected]> wrote:

> Let me try and rephrase by question.
> I have a set of tuples of the form (field1, field2). I need to group by
> 'field1' and then sub-group by 'field2' and output top-k instances of
> field2 for field1. What's the right way of doing that in pig?
>
> What I did was grouped my tuples by 'field1' and passed the DataBag to
> my UDF - top() which just counts the occurrence of each tuple and
> outputs top-K.
> This worked but it didn't look like the most efficient solution.
>
> Can anyone suggest something different?
>
> Thanks
> -Ankur
>
> -----Original Message-----
> From: Goel, Ankur [mailto:[email protected]]
> Sent: Thursday, January 08, 2009 3:03 PM
> To: [email protected]; [email protected]
> Subject: Top-K for nested fields
>
> Hi Folks,
>
>              I have a case where-in I need to do top-K on nested fields
> in my tuple. For e.g. Consider the following tuples (format is [url,
> query])
>
> (abc.com, A)
>
> (abc.com, A)
>
> (abc.com, C)
>
> (abc.com, B)
>
> (xyz.com, D)
>
> (xyz.com, D)
>
> (xyz.com, E)
>
>
>
> I need to be able to group by URL and output top-K queries along with
> their count for each URL. So output would be
>
> Abc.com A 2
>
> Abc.com B 1
>
> Abc.com C 1
>
>
>
>
>
> In my understanding we would do something like
>
>
>
> url = GROUP tuples BY url;
>
> result = FOREACH url GENERATE group, top(10, query)
>
>
>
> Is there a UDF to do this? If not then I can write one and possibly
> contribute.
>
>
>
> Is there any other way of doing it?
>
>
>
> Thanks
>
> -Ankur
>
>


-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Reply via email to