paul-rogers commented on PR #13793: URL: https://github.com/apache/druid/pull/13793#issuecomment-1429032881
This PR is of interest because it impacts the same area as the catalog project. Just FYI, here is some background. With the catalog, we want to validate the type of each column. Quick, what is the type of column X here: ```sql INSERT INTO dst SELECT LATEST(A) AS X FROM TABLE(...) (A VARCHAR) ``` As explained above, if we just look at the `SELECT`, then `X` is `VARCHAR`. (That is, the aggregation is finalized.) But, we want to insert into `dst`. In the catalog schema, what type should we declare `X` to be? It is certainly _not_ `VARCHAR` because, as noted above, we could not further aggregate. So, what is it? The `StringLastAggregatorFactory` tells us that the intermediate type is `COMPLEX<serializablePairLongString>`. If we knew that in the SQL layer, we could reject the following. Notice the change of type of `A` to `BIGINT`: ```sql INSERT INTO dst SELECT LATEST(A) AS X FROM TABLE(...) (A BIGINT) ``` Unfortunately, the COMPLEX<serializablePairLongString>` type is a bit too general, since that is the same type for `EARLIEST`. However, we would like to reject the following in the case where `X` wants to be `LATEST` to prevent the catastrophic combining of the two aggregators on the same column in different ingestions: ```sql INSERT INTO dst SELECT EARLIEST(A) AS X FROM TABLE(...) (A VARCHAR) ``` So, we need an internal type of something like `COMPLEX<latest-string>`. Unfortunately, there is no such type in Druid: the information about the aggregate function is lost. So we need to do something or other to fix this. The challenge for this PR is that SQL knows nothing about the internal types (that I can find). Our SQL aggregators report only the finalized type (`VARCHAR` in this example.) As a result, you're trying to fix a bug without really having the internal structure needed. Let's hope a simple fix can be found that solves this one PR. The actual issue is to reference columns in the output (per SQL) and not the input. Or, perhaps, we have to resolve the larger issue. For MSQ, we have to know if the `LATEST(VARCHAR)` function returns: * `VARCHAR` if MSQ is used to query, or if the user is doing the "aggregate in the `SELECT`, store the result as detail" case. * `COMPLEX<latest-string>` if MSQ is used to ingest, and the `SELECT` is giving the aggregation to use both on ingestion and compaction. The SQL type COMPLEX<latest-string>` maps to the Druid type `COMPLEX<serializablePairLongString>` at the storage level. The Druid Calcite-based planner assumes the first case. (The type of `LATEST(VARCHAR)` is `VARCHAR`). But, to make `LATEST` work in compaction, we need the second form of the type, which we don't have at present. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
