+1 to what Gianmarco said about the place to do it. See sample_clause
in LogicalPlanGenerator.g.
I tried the expanded query (2 dimensions) with 0.8, it results only in 2
MR jobs, the 1st MR job has all the computation being done in a single
MR job. The 2nd MR job just concats the outputs into one file. See-
http://pastebin.com/aarBELC2. I got an exception in 0.9 for same query,
I have created a jira (PIG-2164) to address that.
The CubeDimensions udf would be a nice way to get around a combiner
issue, but the combiner issue (if any) should actually get fixed.
In the example, you are putting all records into same file. That would
lead to a problem, because it will not be possible to distinguish
between the records for (group by (a,b)) that have value of b as null
and (group by (a,null)). If all inputs go into same file, it would need
to have a marker column to indicate the input it belongs to.
I think, in most cases people would read the results of different
group-by combinations separately, so it makes sense to have different
output files. (eg, 8 files if there are 3 dimensions). Ie, a split on
the marker column might have to be introduced.
Thanks,
Thejas
On 7/13/11 6:05 PM, Dmitriy Ryaboy wrote:
Arnab has a really interesting presentation at the post-hadoop-summit
Pig meeting about how Cubing could work in Map-Reduce, and suggested a
straightforward path to integrating into Pig. Arnab, do you have the
presentation posted somewhere?
In any case, I started mucking around a little with this, trying to
hack in the naive solution.
So far, one interesting result, followed by a question:
I manually cubed by writing a bunch of group-bys, like so (using pig 8) :
ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
COUNT_STAR(rel) as cnt;
a_only = foreach (group rel by (a, null)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
b_only = foreach (group rel by (null, b)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
ab = foreach (group rel by (null, null)) generate flatten(group) as
(a, b), COUNT_STAR(rel) as cnt;
cube = union ab, a_only, b_only, ab;
store cube ....
Except for extra fun, I did this with 3 dimensions and therefore 8
groupings. This generated 4 MR jobs, the first of which moved all the
data across the wire despite the fact that COUNT_STAR is algebraic. On
my test dataset, the work took 18 minutes.
I then wrote a UDF that given a tuple, created all the cube dimensions
of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
(null, b), (null, null) }, and this works on any number of dimensions.
The naive cube then simply becomes this:
cubed = foreach rel generate flatten(CubeDimensions(a, b));
cube = foreach (group rel by $0) generate flatten(group) as (a, b),
COUNT_STAR(rel);
On the same dataset, this generated only 1 MR job, and ran in 3
minutes because we were able to take advantage of the combiners!
Assuming algebraic aggregations, this is actually pretty good given
how little work it involves.
I looked at adding a new operator that would be (for now) syntactic
sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
insert the operators equivalent to the code above.
I can muddle my way through the grammar. What's the appropriate place
to put the translation logic? Logical to physical compiler? Optimizer?
The LogicalPlanBuilder?
D