I am in favor of this change. The information in one GROUPING_ID column is the 
same as the information in n GROUPING(i) columns, but it is easier to manage 
one column than several.

I screwed up when implementing 
https://issues.apache.org/jira/browse/CALCITE-370. It would have been a lot 
cheaper to fix this if we had realized during the review process that indicator 
columns would be difficult to handle.

It is important that we do not use a representation that limits the number of 
grouping columns. BIGINT contains 64 bits, and therefore allows 64 grouping 
columns, but that is not enough. We could use a representation that allows an 
unlimited number of bits, say BINARY(ceil(n / 8)). Or we could simply state 
that the client should not assume any particular data type: for 1 grouping 
column the type might be TINYINT, for 10,000 grouping columns the type might be 
BLOB. The GROUPING_ID function will convert the internal type to a number, and 
the GROUPING(i) functions will extract the i’th bit.

We will need to re-organize the output row-type of Aggregate. For an aggregate 
with 3 group keys [k0, k1, k2], grouping sets (i.e. indicator = true), and 2 
aggregate functions [agg1, agg2], the output is currently [k0, k1, k2, i0, k1, 
k2, agg1, agg2] where [i0, i1, i2] are the indicator columns. Should the new 
output be [k0, k1, k2, g, agg1, agg2] or should it be [g, k0, k1, k2, agg1, 
agg2]? (In both cases g is the grouping_id column, which concatenates the 
indicator columns into one value.) The latter is more different from the 
current output format, but since the grouping_id column appears first it will 
“fail fast”. We will discover more quickly that a particular rule is not 
handling grouping_id properly.

Julian



> On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez 
> <[email protected]> wrote:
> 
> Hi,
> 
> I wanted to open a thread with some conversation about changes in Grouping 
> Sets implementation in Calcite.
> 
> Grouping sets are currently implemented in Calcite using a bit to indicate 
> each of the grouping columns. For instance, consider the following group by 
> clause:
> 
> ...
> GROUP BY CUBE (a, b)
> ...
> 
> The generated Aggregate operator in Calcite will have a row schema consisting 
> of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents whether x 
> is participating in the group by clause. In particular, GROUPING returns 1 
> for aggregated or 0 for not aggregated in the result set.
> 
> In contrast, Hive's implementation stores a single number corresponding to 
> the GROUPING bit vector associated with a row (this is the result of the 
> GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the 
> row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].
> 
> This difference is creating a mismatch between Calcite and Hive. Till now, we 
> have solved it in the Hive side: we created our own GROUPING_ID function 
> applied over those columns. However, we have some issues related to 
> predicates pushdown, constant propagation, etc., that we need to continue 
> solving as e.g. new rules are added to our optimizer. In short, this is 
> making the code on the Hive side harder and harder to maintain.
> 
> We were wondering if it would make sense to change the implementation on the 
> Calcite side. Which systems are using Calcite's grouping sets currently? Are 
> you in favor/against this change? Why?
> 
> Thanks,
> Jesús
> 

Reply via email to