Re: Cubing in Pig

Dmitriy Ryaboy Thu, 14 Jul 2011 15:04:38 -0700

In the dw world, using a single table and using null as an "all" marker is the 
standard thing to do. In my udf I actually allow an optional string to be 
passed to the constructor to denote all if null is a valid value... I'll post 
the udf shortly, it's a prerequisite to LOCube.


I suspect the case of splitting out the agg levels is actually more rare, and 
can easily be accomplished with a SPLIT operator. The other nice Thing about 
the udf is how much code it saves, esp for larger numbers of dimensions. 

Perhaps my sample script generated 4 jobs because I had 3 dimensions? 



On Jul 14, 2011, at 4:10 PM, Thejas Nair <the...@hortonworks.com> wrote:

> +1 to what Gianmarco said about the place to do it.  See sample_clause in 
> LogicalPlanGenerator.g.
> 
> I tried the expanded query (2 dimensions) with 0.8, it results only in 2 MR 
> jobs, the 1st MR job has all the computation being done in a single MR job. 
> The 2nd MR job just concats the outputs into one file. See- 
> http://pastebin.com/aarBELC2. I got an exception in 0.9 for same query, I 
> have created a jira (PIG-2164) to address that.
> 
> The CubeDimensions udf would be a nice way to get around a combiner issue, 
> but the combiner issue (if any) should actually get fixed.
> 
> In the example, you are putting all records into same file. That would lead 
> to a problem, because it will not be possible to distinguish between the 
> records for (group by (a,b)) that have value of b as null and (group by 
> (a,null)). If all inputs go into same file, it would need to have a marker 
> column to indicate the input it belongs to.
> I think, in most cases people would read the results of different group-by 
> combinations separately, so it makes sense to have different output files. 
> (eg, 8 files if there are 3 dimensions). Ie, a split on the marker column 
> might have to be introduced.
> 
> 
> 
> Thanks,
> Thejas
> 
> 
> 
> 
> On 7/13/11 6:05 PM, Dmitriy Ryaboy wrote:
>> Arnab has a really interesting presentation at the post-hadoop-summit
>> Pig meeting about how Cubing could work in Map-Reduce, and suggested a
>> straightforward path to integrating into Pig. Arnab, do you have the
>> presentation posted somewhere?
>> 
>> In any case, I started mucking around a little with this, trying to
>> hack in the naive solution.
>> 
>> So far, one interesting result, followed by a question:
>> 
>> I manually cubed by writing a bunch of group-bys, like so (using pig 8) :
>> 
>> ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
>> COUNT_STAR(rel) as cnt;
>> a_only = foreach (group rel by (a, null)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> b_only = foreach (group rel by (null, b)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> ab = foreach (group rel by (null, null)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> cube = union ab, a_only, b_only, ab;
>> store cube ....
>> 
>> Except for extra fun, I did this with 3 dimensions and therefore 8
>> groupings. This generated 4 MR jobs, the first of which moved all the
>> data across the wire despite the fact that COUNT_STAR is algebraic. On
>> my test dataset, the work took 18 minutes.
>> 
>> I then wrote a UDF that given a tuple, created all the cube dimensions
>> of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
>> (null, b), (null, null) }, and this works on any number of dimensions.
>> The naive cube then simply becomes this:
>> 
>> cubed = foreach rel generate flatten(CubeDimensions(a, b));
>> cube = foreach (group rel by $0) generate flatten(group) as (a, b),
>> COUNT_STAR(rel);
>> 
>> On the same dataset, this generated only 1 MR job, and ran in 3
>> minutes because we were able to take advantage of the combiners!
>> 
>> Assuming algebraic aggregations, this is actually pretty good given
>> how little work it involves.
>> 
>> I looked at adding a new operator that would be (for now) syntactic
>> sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
>> insert the operators equivalent to the code above.
>> 
>> I can muddle my way through the grammar. What's the appropriate place
>> to put the translation logic? Logical to physical compiler? Optimizer?
>> The LogicalPlanBuilder?
>> 
>> D
>

Re: Cubing in Pig

Reply via email to