The problem being that as datasets grow, a nested distinct can often lead
to a heap error (I've been testing some scripts in Pig9, and for whatever
reason a bunch of scripts that are on the edge in pig8 are dying in pig9
with heap errors caused by distinct...but there are a lot of moving parts
there). Either way, I'm of the opinion that heap errors are bad!

I was wondering if there are any known methods (or papers in academia) of
efficient ways around this, short of grouping the data two separate times?

so for example, we have

a = load 'thing' as (x,y);
b = foreach (group a by x) {
  dst=distinct a.y;
  generate group COUNT(dst), COUNT(a);
}

so this gives is the total number of y per x, and the distinct number of y
per x (you can imagine that x is a website, y is a cookie, and the values
are distinct cookies and page views.

So in this case, eventually certain sites may get really popular and there
could be enough distinct cookies to kill you.

Now, the way that I'd normally refactor the code is...

a = load 'thing' as (x,y);
b = foreach (group a by x) generate group, COUNT(a);

c = foreach (group a by (x,y)) generate flatten(group) as (x,y);
d = foreach (group c by x) generate group, COUNT(c);

and then you join them on the group key to merge the two. Nasty!

Are there any ways to optimize these sorts of queries on the Pig side to
avoid memory issues, and to keep the syntax clean?

I appreciate the thought
Jon

Reply via email to