Yeah you are right. Thanks -- Prasanth
On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote: > Just to make sure I understand this correctly, is > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > equivalent to: > > out1 = CUBE rel BY (a,b,c); > out2 = ROLLUP rel BY (c,d); > out3 = CUBY rel BY (e,f); > > out = CROSS out1, out2, out3; > > ? > > 2012/6/21 Prasanth J <[email protected]> > >> Hello all >> >> I initially implemented ROLLUP as a separate operation with the following >> syntax >> >> a = ROLLUP inp BY (x,y); >> >> which does the same thing as CUBE (inserting foreach + group-by in logical >> plan) except that it uses RollupDimensions UDF. But the issue with this >> approach is that we cannot mix CUBE and ROLLUP operations together in the >> same syntax which is a typical case. SQL/Oracle supports using CUBE and >> ROLLUP together like >> >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> so I modified the pig grammar to support the similar usage. So now we can >> use a syntax similar to SQL >> >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >> >> In this approach, the logical plan should introduce cartesian product >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for >> generating the final output. But I read from the documentation ( >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator >> is an expensive operator and advices to use it sparingly. >> >> Is there any other way to achieve the cartesian product in a less >> expensive way? Also, does anyone have thoughts about this new syntax? >> >> Thanks >> -- Prasanth >> >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: >> >>> As far as the underlying implementation, if they all use the same >>> optimizations that you use in cube, then it can be LOCube. If they have >>> their own optimizations etc (or could), it may be worth them having their >>> own Logical operators (which might just be LOCube with flags for the time >>> being) that allows us more flexibilty. But I suppose that's between you, >>> eclipse, and your GSOC mentor. >>> >>> 2012/5/30 Prasanth J <[email protected]> >>> >>>> Thanks Alan and Jon for expressing your views. >>>> >>>> I agree with Jon's point, if the syntax contains CUBE then user expects >> it >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and >> concise >>>> >>>> rel = CUBE rel BY (dims); >>>> rel = ROLLUP rel BY (dims); >>>> rel = GROUPING_SET rel BY (dims); >>>> >>>> 2 reasons why I do not prefer using SQL syntax is >>>> 1) I do not want to break into existing Group operator implementation :) >>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>>> For ex: >>>> >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), >> ROLLUP(dim4,dim5,dim6), >>>> ROLLUP(dim7,dim8,dim9); >>>> >>>> whereas same thing can be expressed like >>>> >>>> rel = ROLLUP rel BY dim0, >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>>> >>>> Thanks Alan for pointing out the way for independently managing the >>>> operators in parser and logical/physical plan. So for all these >> operators >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>>> differentiate between these three operations. >>>> >>>> But, yes we are proliferating operators in this case. >>>> >>>> Thanks >>>> -- Prasanth >>>> >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>>> >>>>> >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>>> >>>>>> I was going to say the same thing Alan said w.r.t. operators: >> operators >>>> in >>>>>> the grammar can correspond to whatever logical and physical operators >>>> you >>>>>> want. >>>>>> >>>>>> As far as the principle of least astonishment compared to SQL... Pig >> is >>>>>> already pretty astonishing. I don't know why we would bend over >>>> backwards >>>>>> to make the syntax so similar in this case when even getting to the >>>> point >>>>>> of doing a CUBE means understanding an object model that is pretty >>>>>> different from SQL. >>>>>> >>>>>> On that note, >>>>>> >>>>>> rel = CUBE rel BY GROUPING SETS(cols); >>>>>> >>>>>> seems really confusing. I'd much rather overload the group operating >>>> than >>>>>> the cube operator. If I see "cube," I expect a cube. If you start >> doing >>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. >> Pig >>>>>> latin is simple enough that I don't think having a rollup, group_set, >>>> etc >>>>>> operator will be so confusing, because they're already going to be >>>> typing >>>>>> that stuff in the conext of >>>>>> >>>>>> group rel by rollup(cols); and so on. I don't see how it's worth >> adding >>>>>> more, confusing syntax for the sake of creating parallels with a >>>> language >>>>>> we now share very little with. >>>>> >>>>> Fair points. >>>>> >>>>>> >>>>>> But I won't beat it any further... if people prefer a different >> syntax, >>>>>> that's fine. Just excited to have the features in Pig! >>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, >> and >>>> Jon's). >>>>> >>>>> Alan. >>>>> >>>>>> Jon >>>>>> >>>>>> 2012/5/30 Alan Gates <[email protected]> >>>>>> >>>>>>> Some thoughts on this: >>>>>>> >>>>>>> 1) +1 to what Dmitriy said on HAVING >>>>>>> >>>>>>> 2) We need to be clear about separating operators in the grammar >> versus >>>>>>> logical plan versus physical plan. The choices you make in the >>>> grammar are >>>>>>> totally independent of the other two. That is, you could choose the >>>> syntax: >>>>>>> >>>>>>> rel = GROUP rel BY CUBE (a, b, c) >>>>>>> >>>>>>> and still have a separate POCube operator. When the parser sees >> GROUP >>>> BY >>>>>>> CUBE it will generate an LOCube operator for the logical plan rather >>>> than >>>>>>> an LOGroup operator. You can still have a separate POCube physical >>>>>>> operator. Separate optimizations can be applied to LOGroup vs. >> LOCube >>>> and >>>>>>> POGroup vs. POCube. >>>>>>> >>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as >>>> possible >>>>>>> and for the syntax proposed by Prasanth. The argument for sticking >>>> close >>>>>>> to SQL is it conforms to the law of least astonishment. It wouldn't >> be >>>>>>> exactly SQL, as it would end up looking like: >>>>>>> >>>>>>> rel = GROUP rel BY CUBE (cols) >>>>>>> rel = GROUP rel BY ROLLUP (cols) >>>>>>> rel = GROUP rel BY GROUPING SETS(cols); >>>>>>> >>>>>>> The argument I see for sticking with Prasanth's approach is that >> GROUP >>>> is >>>>>>> really short for COGROUP in Pig Latin, and I don't think we're >>>> proposing >>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to >> do >>>> such >>>>>>> a thing. This makes CUBE really a separate operation. But if we go >>>> this >>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE >>>> rel BY >>>>>>> GROUPING SETS. Let's not proliferate operators. >>>>>>> >>>>>>> Alan. >>>>>>> >>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >>>>>>> >>>>>>>> Thanks Jonathan for looking into it and for your suggestions. >>>>>>>> >>>>>>>> The reason why I came with a clause rather than a separate operator >>>> was >>>>>>> to avoid adding additional operators to the grammar. >>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators >>>>>>> adding to the complexity. I am planning to keep everything under cube >>>>>>> operator, so only LOCube and POCube operators will be added >>>> additionally. >>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is >>>> the >>>>>>> same as FILTER so we do not need a separate HAVING clause. >>>>>>>> >>>>>>>> I will give a quick recap of cube related operations and multiple >>>> syntax >>>>>>> options for achieving the same. I am also adding partial cubing and >>>> rollup >>>>>>> in this discussion. >>>>>>>> >>>>>>>> 1) CUBE >>>>>>>> >>>>>>>> Current syntax: >>>>>>>> alias = CUBE rel BY (a, b); >>>>>>>> >>>>>>>> Following group-by's will be computed: >>>>>>>> (a, b) >>>>>>>> (a) >>>>>>>> (b) >>>>>>>> () >>>>>>>> >>>>>>>> 2) Partial CUBE >>>>>>>> >>>>>>>> Proposed syntax: >>>>>>>> alias = CUBE rel BY a, (b, c); >>>>>>>> >>>>>>>> Following group-by's will be computed: >>>>>>>> (a, b, c) >>>>>>>> (a, b) >>>>>>>> (a, c) >>>>>>>> (a) >>>>>>>> >>>>>>>> 3) ROLLUP >>>>>>>> >>>>>>>> Proposed syntax 1: >>>>>>>> alias = CUBE rel BY ROLLUP(a, b); >>>>>>>> >>>>>>>> Proposed syntax 2: >>>>>>>> alias = CUBE rel BY (a::b); >>>>>>>> >>>>>>>> Proposed syntax 3: >>>>>>>> alias = ROLLUP rel BY (a, b); >>>>>>>> >>>>>>>> Following group-by's will be computed: >>>>>>>> (a, b) >>>>>>>> (a) >>>>>>>> () >>>>>>>> >>>>>>>> 4) Partial ROLLUP >>>>>>>> >>>>>>>> Proposed syntax 1: >>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c); >>>>>>>> >>>>>>>> Proposed syntax 2: >>>>>>>> alias = CUBE rel BY (a, b::c); >>>>>>>> >>>>>>>> Proposed syntax 3: >>>>>>>> alias = ROLLUP rel BY a, (b, c); >>>>>>>> >>>>>>>> Following group-by's will be computed: >>>>>>>> (a, b, c) >>>>>>>> (a, b) >>>>>>>> (a) >>>>>>>> >>>>>>>> 5) GROUPING SETS >>>>>>>> >>>>>>>> Proposed syntax 1: >>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>>>>>>> >>>>>>>> Proposed syntax 2: >>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)} >>>>>>>> >>>>>>>> Proposed syntax 3: >>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>>>>>>> >>>>>>>> Following group-by's will be computed: >>>>>>>> (a) >>>>>>>> (b, c) >>>>>>>> (c) >>>>>>>> >>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >>>>>>> before I start hacking the grammar file. >>>>>>>> >>>>>>>> Thanks >>>>>>>> -- Prasanth >>>>>>>> >>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>>>>>>> >>>>>>>>> Hey Prashanth, happy hacking. >>>>>>>>> >>>>>>>>> My opinion: >>>>>>>>> >>>>>>>>> CUBE: >>>>>>>>> >>>>>>>>> alias = CUBE rel BY (a,b,c); >>>>>>>>> >>>>>>>>> >>>>>>>>> I like that syntax. It's unambiguous what is going on. >>>>>>>>> >>>>>>>>> >>>>>>>>> ROLLUP: >>>>>>>>> >>>>>>>>> >>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>>>>>>> >>>>>>>>> >>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did >>>> with >>>>>>> CUBE. IE >>>>>>>>> >>>>>>>>> >>>>>>>>> alias = ROLLUP rel BY (a,b,c); >>>>>>>>> >>>>>>>>> >>>>>>>>> GROUPING SETS: >>>>>>>>> >>>>>>>>> >>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>>>>>>> >>>>>>>>> >>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. >>>> maybe >>>>>>>>> following the >>>>>>>>> same pattern you could do something like: >>>>>>>>> >>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>>>>>>> >>>>>>>>> As far as having, is there an optimization that can be done with a >>>>>>> HAVING >>>>>>>>> clause that can't be done based on the logical plan that comes >>>>>>> afterwards? >>>>>>>>> That seems odd to me. Since you have to materialize the result >>>> anyway, >>>>>>>>> can't the having clause just be a FILTER that comes after the >> cube? I >>>>>>> don't >>>>>>>>> know why we need a special syntax. >>>>>>>>> >>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> 2012/5/27 Prasanth J <[email protected]> >>>>>>>>> >>>>>>>>>> Hello everyone >>>>>>>>>> >>>>>>>>>> I am looking for feedback from the community about the syntax for >>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone >>>> can >>>>>>>>>> share their opinion for operator syntax. Please have a look at the >>>>>>> syntax >>>>>>>>>> proposal at the link below and let me know your opinion >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> -- Prasanth >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>> >> >>
