Hi, I wanted to open a thread with some conversation about changes in Grouping Sets implementation in Calcite.
Grouping sets are currently implemented in Calcite using a bit to indicate each of the grouping columns. For instance, consider the following group by clause: ... GROUP BY CUBE (a, b) ... The generated Aggregate operator in Calcite will have a row schema consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents whether x is participating in the group by clause. In particular, GROUPING returns 1 for aggregated or 0 for not aggregated in the result set. In contrast, Hive's implementation stores a single number corresponding to the GROUPING bit vector associated with a row (this is the result of the GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)]. This difference is creating a mismatch between Calcite and Hive. Till now, we have solved it in the Hive side: we created our own GROUPING_ID function applied over those columns. However, we have some issues related to predicates pushdown, constant propagation, etc., that we need to continue solving as e.g. new rules are added to our optimizer. In short, this is making the code on the Hive side harder and harder to maintain. We were wondering if it would make sense to change the implementation on the Calcite side. Which systems are using Calcite's grouping sets currently? Are you in favor/against this change? Why? Thanks, Jesús
