I've done this with the following: raw = load 'thing' as (user,page);pageviews = foreach (group raw by (user, page)) generate flatten(group), count($1) as pageviews;
pagecounts = foreach ( group pageviews by page ) generate flatten(group), count($1) as uniques, sum(pageviews) as pageviews; It's the only way I've been able to get it to scale. On Wed, Nov 23, 2011 at 22:44, Jonathan Coveney <[email protected]> wrote: > The problem being that as datasets grow, a nested distinct can often lead > to a heap error (I've been testing some scripts in Pig9, and for whatever > reason a bunch of scripts that are on the edge in pig8 are dying in pig9 > with heap errors caused by distinct...but there are a lot of moving parts > there). Either way, I'm of the opinion that heap errors are bad! > > I was wondering if there are any known methods (or papers in academia) of > efficient ways around this, short of grouping the data two separate times? > > so for example, we have > > a = load 'thing' as (x,y); > b = foreach (group a by x) { > dst=distinct a.y; > generate group COUNT(dst), COUNT(a); > } > > so this gives is the total number of y per x, and the distinct number of y > per x (you can imagine that x is a website, y is a cookie, and the values > are distinct cookies and page views. > > So in this case, eventually certain sites may get really popular and there > could be enough distinct cookies to kill you. > > Now, the way that I'd normally refactor the code is... > > a = load 'thing' as (x,y); > b = foreach (group a by x) generate group, COUNT(a); > > c = foreach (group a by (x,y)) generate flatten(group) as (x,y); > d = foreach (group c by x) generate group, COUNT(c); > > and then you join them on the group key to merge the two. Nasty! > > Are there any ways to optimize these sorts of queries on the Pig side to > avoid memory issues, and to keep the syntax clean? > > I appreciate the thought > Jon >
