Arnab has a really interesting presentation at the post-hadoop-summit Pig meeting about how Cubing could work in Map-Reduce, and suggested a straightforward path to integrating into Pig. Arnab, do you have the presentation posted somewhere?
In any case, I started mucking around a little with this, trying to hack in the naive solution. So far, one interesting result, followed by a question: I manually cubed by writing a bunch of group-bys, like so (using pig 8) : ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b), COUNT_STAR(rel) as cnt; a_only = foreach (group rel by (a, null)) generate flatten(group) as (a, b), COUNT_STAR(rel) as cnt; b_only = foreach (group rel by (null, b)) generate flatten(group) as (a, b), COUNT_STAR(rel) as cnt; ab = foreach (group rel by (null, null)) generate flatten(group) as (a, b), COUNT_STAR(rel) as cnt; cube = union ab, a_only, b_only, ab; store cube .... Except for extra fun, I did this with 3 dimensions and therefore 8 groupings. This generated 4 MR jobs, the first of which moved all the data across the wire despite the fact that COUNT_STAR is algebraic. On my test dataset, the work took 18 minutes. I then wrote a UDF that given a tuple, created all the cube dimensions of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null), (null, b), (null, null) }, and this works on any number of dimensions. The naive cube then simply becomes this: cubed = foreach rel generate flatten(CubeDimensions(a, b)); cube = foreach (group rel by $0) generate flatten(group) as (a, b), COUNT_STAR(rel); On the same dataset, this generated only 1 MR job, and ran in 3 minutes because we were able to take advantage of the combiners! Assuming algebraic aggregations, this is actually pretty good given how little work it involves. I looked at adding a new operator that would be (for now) syntactic sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would insert the operators equivalent to the code above. I can muddle my way through the grammar. What's the appropriate place to put the translation logic? Logical to physical compiler? Optimizer? The LogicalPlanBuilder? D