Jon, I ran the right script, I just wrote out the wrong one in the email :-).
I also compared results of both computations to ensure correctness.

Arnab posted his slides: http://pdf.cx/44wrk
My approach is the "naive approach" described in slides 11-17.

D

On Thu, Jul 14, 2011 at 11:54 AM, Jonathan Coveney <[email protected]> wrote:
> Dmitry, a quick point on your approach...
>
> I assume that you meant to do, replacing rel with cubed? If you ran what you
> pasted, you don't actually make reference to the cubed that you output,
> which may have influenced run time.
>
> cubed = foreach rel generate flatten(CubeDimensions(a, b));
> cube = foreach (group cubed by $0) generate flatten(group) as (a,
> b), COUNT_STAR(rel);
>
> Gianmarco:
>
> Let's say that rel looks like this:
>
> 1,1
> 1,2
> 2,2
>
> your results from group on (a,b)
> (1,1):1
> (1,2):1
> (2,2):1
>
> grouping on (a,null):
> (1,null): 2
> (2,null): 1
>
> grouping on (null,b):
> (null,1): 1
> (null,2): 2
>
> group on (null,null):
> (null,null): 3
>
> here is what cubed would look like
>
> {(1,1),(1,null),(null,1),(null,null)}
> {(1,2),(1,null),(null,2),(null,null)}
> {(2,2),(2,null),(null,2),(null,null)}
>
> When you flatten it out, you'll have
>
> (1,1)
> (1,null)
> (null,1)
> (null,null)
> (1,2)
> (1,null)
> (null,2)
> (null,null)
> (2,2)
> (2,null)
> (null,2)
> (null,null)
>
> now we group on the value, of which the posibilities/counts are...
>
> (1,1):1
> (1,2):1
> (2,2):1
> (1,null): 2
> (2,null): 1
> (null,1): 1
> (null,2): 2
> (null,null): 3
>
> The same. What you're doing is blowing up the intermediate info.
>
> Now a point on methodology;
>
> To implement the CUBE command, might it be faster to do this in the map job
> itself? IE when you hit a row, you emit all of the combinations. This is
> essentially the same thing, just at a lower level. Of course for big cubes
> the issue is going to be the exponential increase in space
>
>
> 2011/7/14 Gianmarco <[email protected]>
>
>> If you want to add a new operator the right place to add the logic should
>> be
>> LogicalPlanBuilder.
>>
>> Just a question, are you sure this code is correct? I can't understand how
>> it works.
>>
>> cubed = foreach rel generate flatten(CubeDimensions(a, b));
>> cube = foreach (group rel by $0) generate flatten(group) as (a, b),
>> COUNT_STAR(rel);
>>
>>
>> Cheers,
>> --
>> Gianmarco De Francisci Morales
>>
>>
>> On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <[email protected]> wrote:
>>
>> > Arnab has a really interesting presentation at the post-hadoop-summit
>> > Pig meeting about how Cubing could work in Map-Reduce, and suggested a
>> > straightforward path to integrating into Pig. Arnab, do you have the
>> > presentation posted somewhere?
>> >
>> > In any case, I started mucking around a little with this, trying to
>> > hack in the naive solution.
>> >
>> > So far, one interesting result, followed by a question:
>> >
>> > I manually cubed by writing a bunch of group-bys, like so (using pig 8) :
>> >
>> > ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
>> > COUNT_STAR(rel) as cnt;
>> > a_only = foreach (group rel by (a, null)) generate flatten(group) as
>> > (a, b), COUNT_STAR(rel) as cnt;
>> > b_only = foreach (group rel by (null, b)) generate flatten(group) as
>> > (a, b), COUNT_STAR(rel) as cnt;
>> > ab = foreach (group rel by (null, null)) generate flatten(group) as
>> > (a, b), COUNT_STAR(rel) as cnt;
>> > cube = union ab, a_only, b_only, ab;
>> > store cube ....
>> >
>> > Except for extra fun, I did this with 3 dimensions and therefore 8
>> > groupings. This generated 4 MR jobs, the first of which moved all the
>> > data across the wire despite the fact that COUNT_STAR is algebraic. On
>> > my test dataset, the work took 18 minutes.
>> >
>> > I then wrote a UDF that given a tuple, created all the cube dimensions
>> > of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
>> > (null, b), (null, null) }, and this works on any number of dimensions.
>> > The naive cube then simply becomes this:
>> >
>> > cubed = foreach rel generate flatten(CubeDimensions(a, b));
>> > cube = foreach (group rel by $0) generate flatten(group) as (a, b),
>> > COUNT_STAR(rel);
>> >
>> > On the same dataset, this generated only 1 MR job, and ran in 3
>> > minutes because we were able to take advantage of the combiners!
>> >
>> > Assuming algebraic aggregations, this is actually pretty good given
>> > how little work it involves.
>> >
>> > I looked at adding a new operator that would be (for now) syntactic
>> > sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
>> > insert the operators equivalent to the code above.
>> >
>> > I can muddle my way through the grammar. What's the appropriate place
>> > to put the translation logic? Logical to physical compiler? Optimizer?
>> > The LogicalPlanBuilder?
>> >
>> > D
>> >
>>
>

Reply via email to