I've done this with the following:
raw = load 'thing' as (user,page);pageviews = foreach (group raw by
(user, page)) generate flatten(group), count($1) as pageviews;

pagecounts = foreach ( group pageviews by page ) generate
flatten(group), count($1) as uniques, sum(pageviews) as pageviews;

It's the only way I've been able to get it to scale.


On Wed, Nov 23, 2011 at 22:44, Jonathan Coveney <[email protected]> wrote:
> The problem being that as datasets grow, a nested distinct can often lead
> to a heap error (I've been testing some scripts in Pig9, and for whatever
> reason a bunch of scripts that are on the edge in pig8 are dying in pig9
> with heap errors caused by distinct...but there are a lot of moving parts
> there). Either way, I'm of the opinion that heap errors are bad!
>
> I was wondering if there are any known methods (or papers in academia) of
> efficient ways around this, short of grouping the data two separate times?
>
> so for example, we have
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) {
>  dst=distinct a.y;
>  generate group COUNT(dst), COUNT(a);
> }
>
> so this gives is the total number of y per x, and the distinct number of y
> per x (you can imagine that x is a website, y is a cookie, and the values
> are distinct cookies and page views.
>
> So in this case, eventually certain sites may get really popular and there
> could be enough distinct cookies to kill you.
>
> Now, the way that I'd normally refactor the code is...
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) generate group, COUNT(a);
>
> c = foreach (group a by (x,y)) generate flatten(group) as (x,y);
> d = foreach (group c by x) generate group, COUNT(c);
>
> and then you join them on the group key to merge the two. Nasty!
>
> Are there any ways to optimize these sorts of queries on the Pig side to
> avoid memory issues, and to keep the syntax clean?
>
> I appreciate the thought
> Jon
>

Reply via email to