Re: Top-K for nested fields

Alan Gates Tue, 13 Jan 2009 08:43:49 -0800

I would encourage you to open a JIRA. If people disagree withputting limit in the nested foreach they can make their argumentsagainst it there. In general, our desire is to make PIg Latin fullynestable (so any keyword could be in a foreach). Adding this featureshould be very simple, as limit is easy to to implement. So ifsomeone wanted to take this on it should not be much work. I don'thave time to implement and test it, but I'm happy to provide guidanceon the necessary changes to anyone interested.


Alan.


On Jan 12, 2009, at 11:52 PM, Goel, Ankur wrote:

Rad,
     Pig types branch does have support for LIMIT but not for nested
structures inside FOREACH. So as a workaround I did implement a top()
UDF.
But I think it makes sense to have LIMIT support for nested structures
also.
We can open a JIRA for this is people agree.

Thanks
-Ankur

-----Original Message-----
From: rad gara [mailto:[email protected]]
Sent: Monday, January 12, 2009 5:55 PM
To: [email protected]
Subject: Re: Top-K for nested fields

Ankur, concerning your code below, a TakeFirst(bag, count) UDF can be
implemented.  So the desired line would be
topK = TakeFirst(ordered, 10);

But I guess perfomance of nested FOREACH statement can be not very
good when processing large bags within FOREACH (right?).  Seems that
Pig support of LIMIT is necessary for limiting large relations.

2009/1/12 Goel, Ankur <[email protected]>:

Hi Ted,
        Thanks for the response. What you suggested will still need

the

use of a UDF (top) that will be case specific. I was thinking if

there's

a way we can generalize it so that people can do top-K on nested
results.

Better yet if PIG itself supported it by having LIMIT inside FOREACH.

To

give a better idea of what I am talking about here's some sample
script...

data = LOAD 'myfile' as (url, query);
grouped = GROUP data BY (url, query);
groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
clicks;
grouped_by_url = GROUP groupCount BY url;
results = FOREACH grouped_by_url {
               ordered = ORDER groupCount BY clicks DESC;
           topK = LIMIT ordered 10; // This is not supported but I

wish

it were :-)
               GENERATE FLATTEN(topK);
};
STORE results INTO 'mydir' USING PigStorage();

Do you think it makes sense for PIG to support it? If not then do we
resort to a generic top() UDF ?

Thanks
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Saturday, January 10, 2009 12:12 AM
To: [email protected]
Subject: Re: Top-K for nested fields

I think you could turn that inside out and do the counting first by
grouping
on both fields and then do the top-n by grouping on field1.  I would
cautiously expect that to be a bit faster.

On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <[email protected]>
wrote:

Let me try and rephrase by question.
I have a set of tuples of the form (field1, field2). I need to group

by

'field1' and then sub-group by 'field2' and output top-kinstances of
field2 for field1. What's the right way of doing that in pig?

What I did was grouped my tuples by 'field1' and passed the DataBag

to

my UDF - top() which just counts the occurrence of each tuple and
outputs top-K.
This worked but it didn't look like the most efficient solution.

Can anyone suggest something different?

Thanks
-Ankur

-----Original Message-----
From: Goel, Ankur [mailto:[email protected]]
Sent: Thursday, January 08, 2009 3:03 PM
To: [email protected]; [email protected]
Subject: Top-K for nested fields

Hi Folks,

             I have a case where-in I need to do top-K on nested

fields

in my tuple. For e.g. Consider the following tuples (format is [url,
query])

(abc.com, A)

(abc.com, A)

(abc.com, C)

(abc.com, B)

(xyz.com, D)

(xyz.com, D)

(xyz.com, E)

I need to be able to group by URL and output top-K queries alongwith

their count for each URL. So output would be

Abc.com A 2

Abc.com B 1

Abc.com C 1





In my understanding we would do something like



url = GROUP tuples BY url;

result = FOREACH url GENERATE group, top(10, query)



Is there a UDF to do this? If not then I can write one and possibly
contribute.



Is there any other way of doing it?



Thanks

-Ankur



--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: Top-K for nested fields

Reply via email to