I am in favor of this change. The information in one GROUPING_ID column is the same as the information in n GROUPING(i) columns, but it is easier to manage one column than several.
I screwed up when implementing https://issues.apache.org/jira/browse/CALCITE-370. It would have been a lot cheaper to fix this if we had realized during the review process that indicator columns would be difficult to handle. It is important that we do not use a representation that limits the number of grouping columns. BIGINT contains 64 bits, and therefore allows 64 grouping columns, but that is not enough. We could use a representation that allows an unlimited number of bits, say BINARY(ceil(n / 8)). Or we could simply state that the client should not assume any particular data type: for 1 grouping column the type might be TINYINT, for 10,000 grouping columns the type might be BLOB. The GROUPING_ID function will convert the internal type to a number, and the GROUPING(i) functions will extract the i’th bit. We will need to re-organize the output row-type of Aggregate. For an aggregate with 3 group keys [k0, k1, k2], grouping sets (i.e. indicator = true), and 2 aggregate functions [agg1, agg2], the output is currently [k0, k1, k2, i0, k1, k2, agg1, agg2] where [i0, i1, i2] are the indicator columns. Should the new output be [k0, k1, k2, g, agg1, agg2] or should it be [g, k0, k1, k2, agg1, agg2]? (In both cases g is the grouping_id column, which concatenates the indicator columns into one value.) The latter is more different from the current output format, but since the grouping_id column appears first it will “fail fast”. We will discover more quickly that a particular rule is not handling grouping_id properly. Julian > On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez > <[email protected]> wrote: > > Hi, > > I wanted to open a thread with some conversation about changes in Grouping > Sets implementation in Calcite. > > Grouping sets are currently implemented in Calcite using a bit to indicate > each of the grouping columns. For instance, consider the following group by > clause: > > ... > GROUP BY CUBE (a, b) > ... > > The generated Aggregate operator in Calcite will have a row schema consisting > of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents whether x > is participating in the group by clause. In particular, GROUPING returns 1 > for aggregated or 0 for not aggregated in the result set. > > In contrast, Hive's implementation stores a single number corresponding to > the GROUPING bit vector associated with a row (this is the result of the > GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the > row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)]. > > This difference is creating a mismatch between Calcite and Hive. Till now, we > have solved it in the Hive side: we created our own GROUPING_ID function > applied over those columns. However, we have some issues related to > predicates pushdown, constant propagation, etc., that we need to continue > solving as e.g. new rules are added to our optimizer. In short, this is > making the code on the Hive side harder and harder to maintain. > > We were wondering if it would make sense to change the implementation on the > Calcite side. Which systems are using Calcite's grouping sets currently? Are > you in favor/against this change? Why? > > Thanks, > Jesús >
