Thanks for your quick reply Julian. Although I do not feel too strongly about one choice or another, I think we could use a single type, as it will simplify the GROUPING_ID and GROUPING functions implementation. However, I agree with you that we cannot limit the number of grouping columns. Thus, using BINARY as you suggested seems a logical choice.
Wrt Aggregate output, I was thinking on implementing it as [k0, k1, k2, g, agg1, agg2] : the only reason is that Hive implements it that way. However, I would like to understand why you think [g, k0, k1, k2, agg1, agg2] could be a better choice, I did not understand it fully. What do you mean with "fail fast"? And about discovering rules that do not handle GROUPING_ID properly? -- Jesús On 12/8/15, 8:16 PM, "Julian Hyde" <[email protected]> wrote: >I am in favor of this change. The information in one GROUPING_ID column is the >same as the information in n GROUPING(i) columns, but it is easier to manage >one column than several. > >I screwed up when implementing >https://issues.apache.org/jira/browse/CALCITE-370. It would have been a lot >cheaper to fix this if we had realized during the review process that >indicator columns would be difficult to handle. > >It is important that we do not use a representation that limits the number of >grouping columns. BIGINT contains 64 bits, and therefore allows 64 grouping >columns, but that is not enough. We could use a representation that allows an >unlimited number of bits, say BINARY(ceil(n / 8)). Or we could simply state >that the client should not assume any particular data type: for 1 grouping >column the type might be TINYINT, for 10,000 grouping columns the type might >be BLOB. The GROUPING_ID function will convert the internal type to a number, >and the GROUPING(i) functions will extract the i’th bit. > >We will need to re-organize the output row-type of Aggregate. For an aggregate >with 3 group keys [k0, k1, k2], grouping sets (i.e. indicator = true), and 2 >aggregate functions [agg1, agg2], the output is currently [k0, k1, k2, i0, k1, >k2, agg1, agg2] where [i0, i1, i2] are the indicator columns. Should the new >output be [k0, k1, k2, g, agg1, agg2] or should it be [g, k0, k1, k2, agg1, >agg2]? (In both cases g is the grouping_id column, which concatenates the >indicator columns into one value.) The latter is more different from the >current output format, but since the grouping_id column appears first it will >“fail fast”. We will discover more quickly that a particular rule is not >handling grouping_id properly. > >Julian > > > >> On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez >> <[email protected]> wrote: >> >> Hi, >> >> I wanted to open a thread with some conversation about changes in Grouping >> Sets implementation in Calcite. >> >> Grouping sets are currently implemented in Calcite using a bit to indicate >> each of the grouping columns. For instance, consider the following group by >> clause: >> >> ... >> GROUP BY CUBE (a, b) >> ... >> >> The generated Aggregate operator in Calcite will have a row schema >> consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents >> whether x is participating in the group by clause. In particular, GROUPING >> returns 1 for aggregated or 0 for not aggregated in the result set. >> >> In contrast, Hive's implementation stores a single number corresponding to >> the GROUPING bit vector associated with a row (this is the result of the >> GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the >> row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)]. >> >> This difference is creating a mismatch between Calcite and Hive. Till now, >> we have solved it in the Hive side: we created our own GROUPING_ID function >> applied over those columns. However, we have some issues related to >> predicates pushdown, constant propagation, etc., that we need to continue >> solving as e.g. new rules are added to our optimizer. In short, this is >> making the code on the Hive side harder and harder to maintain. >> >> We were wondering if it would make sense to change the implementation on the >> Calcite side. Which systems are using Calcite's grouping sets currently? Are >> you in favor/against this change? Why? >> >> Thanks, >> Jesús >> > >
