I would encourage you to open a JIRA. If people disagree with
putting limit in the nested foreach they can make their arguments
against it there. In general, our desire is to make PIg Latin fully
nestable (so any keyword could be in a foreach). Adding this feature
should be very simple, as limit is easy to to implement. So if
someone wanted to take this on it should not be much work. I don't
have time to implement and test it, but I'm happy to provide guidance
on the necessary changes to anyone interested.
Alan.
On Jan 12, 2009, at 11:52 PM, Goel, Ankur wrote:
Rad,
Pig types branch does have support for LIMIT but not for nested
structures inside FOREACH. So as a workaround I did implement a top()
UDF.
But I think it makes sense to have LIMIT support for nested structures
also.
We can open a JIRA for this is people agree.
Thanks
-Ankur
-----Original Message-----
From: rad gara [mailto:[email protected]]
Sent: Monday, January 12, 2009 5:55 PM
To: [email protected]
Subject: Re: Top-K for nested fields
Ankur, concerning your code below, a TakeFirst(bag, count) UDF can be
implemented. So the desired line would be
topK = TakeFirst(ordered, 10);
But I guess perfomance of nested FOREACH statement can be not very
good when processing large bags within FOREACH (right?). Seems that
Pig support of LIMIT is necessary for limiting large relations.
2009/1/12 Goel, Ankur <[email protected]>:
Hi Ted,
Thanks for the response. What you suggested will still need
the
use of a UDF (top) that will be case specific. I was thinking if
there's
a way we can generalize it so that people can do top-K on nested
results.
Better yet if PIG itself supported it by having LIMIT inside FOREACH.
To
give a better idea of what I am talking about here's some sample
script...
data = LOAD 'myfile' as (url, query);
grouped = GROUP data BY (url, query);
groupCount = FOREACH grouped GENERATE FLATTEN(group), COUNT(*) as
clicks;
grouped_by_url = GROUP groupCount BY url;
results = FOREACH grouped_by_url {
ordered = ORDER groupCount BY clicks DESC;
topK = LIMIT ordered 10; // This is not supported but I
wish
it were :-)
GENERATE FLATTEN(topK);
};
STORE results INTO 'mydir' USING PigStorage();
Do you think it makes sense for PIG to support it? If not then do we
resort to a generic top() UDF ?
Thanks
-Ankur
-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Saturday, January 10, 2009 12:12 AM
To: [email protected]
Subject: Re: Top-K for nested fields
I think you could turn that inside out and do the counting first by
grouping
on both fields and then do the top-n by grouping on field1. I would
cautiously expect that to be a bit faster.
On Fri, Jan 9, 2009 at 4:11 AM, Goel, Ankur <[email protected]>
wrote:
Let me try and rephrase by question.
I have a set of tuples of the form (field1, field2). I need to group
by
'field1' and then sub-group by 'field2' and output top-k
instances of
field2 for field1. What's the right way of doing that in pig?
What I did was grouped my tuples by 'field1' and passed the DataBag
to
my UDF - top() which just counts the occurrence of each tuple and
outputs top-K.
This worked but it didn't look like the most efficient solution.
Can anyone suggest something different?
Thanks
-Ankur
-----Original Message-----
From: Goel, Ankur [mailto:[email protected]]
Sent: Thursday, January 08, 2009 3:03 PM
To: [email protected]; [email protected]
Subject: Top-K for nested fields
Hi Folks,
I have a case where-in I need to do top-K on nested
fields
in my tuple. For e.g. Consider the following tuples (format is [url,
query])
(abc.com, A)
(abc.com, A)
(abc.com, C)
(abc.com, B)
(xyz.com, D)
(xyz.com, D)
(xyz.com, E)
I need to be able to group by URL and output top-K queries along
with
their count for each URL. So output would be
Abc.com A 2
Abc.com B 1
Abc.com C 1
In my understanding we would do something like
url = GROUP tuples BY url;
result = FOREACH url GENERATE group, top(10, query)
Is there a UDF to do this? If not then I can write one and possibly
contribute.
Is there any other way of doing it?
Thanks
-Ankur
--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)