Changes for Grouping Sets in Calcite

Jesus Camacho Rodriguez Tue, 08 Dec 2015 05:51:08 -0800

Hi,

I wanted to open a thread with some conversation about changes in Grouping Sets 
implementation in Calcite.


Grouping sets are currently implemented in Calcite using a bit to indicate each 
of the grouping columns. For instance, consider the following group by clause:

...
GROUP BY CUBE (a, b)
...

The generated Aggregate operator in Calcite will have a row schema consisting 
of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents whether x is 
participating in the group by clause. In particular, GROUPING returns 1 for 
aggregated or 0 for not aggregated in the result set.

In contrast, Hive's implementation stores a single number corresponding to the 
GROUPING bit vector associated with a row (this is the result of the 
GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row 
schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].

This difference is creating a mismatch between Calcite and Hive. Till now, we 
have solved it in the Hive side: we created our own GROUPING_ID function 
applied over those columns. However, we have some issues related to predicates 
pushdown, constant propagation, etc., that we need to continue solving as e.g. 
new rules are added to our optimizer. In short, this is making the code on the 
Hive side harder and harder to maintain.

We were wondering if it would make sense to change the implementation on the 
Calcite side. Which systems are using Calcite's grouping sets currently? Are 
you in favor/against this change? Why?

Thanks,
Jesús

Changes for Grouping Sets in Calcite

Reply via email to