If you are willing to give up some (very small) precision, for this specific kind of queries, you can use approximate counters like Flajolet-Martin or HyperLogLog counters. We could implement them in a special COUNT_APPROX() builtin function. You can also use bloom filters to have an approximate distinct implementation.
For the general case, I think there is no solution. Inherently a nested operation is handled locally, so memory restrictions apply. Cheers, -- Gianmarco On Thu, Nov 24, 2011 at 04:44, Jonathan Coveney <[email protected]> wrote: > The problem being that as datasets grow, a nested distinct can often lead > to a heap error (I've been testing some scripts in Pig9, and for whatever > reason a bunch of scripts that are on the edge in pig8 are dying in pig9 > with heap errors caused by distinct...but there are a lot of moving parts > there). Either way, I'm of the opinion that heap errors are bad! > > I was wondering if there are any known methods (or papers in academia) of > efficient ways around this, short of grouping the data two separate times? > > so for example, we have > > a = load 'thing' as (x,y); > b = foreach (group a by x) { > dst=distinct a.y; > generate group COUNT(dst), COUNT(a); > } > > so this gives is the total number of y per x, and the distinct number of y > per x (you can imagine that x is a website, y is a cookie, and the values > are distinct cookies and page views. > > So in this case, eventually certain sites may get really popular and there > could be enough distinct cookies to kill you. > > Now, the way that I'd normally refactor the code is... > > a = load 'thing' as (x,y); > b = foreach (group a by x) generate group, COUNT(a); > > c = foreach (group a by (x,y)) generate flatten(group) as (x,y); > d = foreach (group c by x) generate group, COUNT(c); > > and then you join them on the group key to merge the two. Nasty! > > Are there any ways to optimize these sorts of queries on the Pig side to > avoid memory issues, and to keep the syntax clean? > > I appreciate the thought > Jon >
