[GitHub] [druid] paul-rogers commented on pull request #13793: Add validation for aggregations on __time

via GitHub Mon, 13 Feb 2023 18:51:21 -0800


paul-rogers commented on PR #13793:
URL: https://github.com/apache/druid/pull/13793#issuecomment-1429032881


   This PR is of interest because it impacts the same area as the catalog 
project. Just FYI, here is some background.
   
   With the catalog, we want to validate the type of each column. Quick, what 
is the type of column X here:
   
   ```sql
   INSERT INTO dst
   SELECT LATEST(A) AS X FROM TABLE(...) (A VARCHAR)
   ```
   
   As explained above, if we just look at the `SELECT`, then `X` is `VARCHAR`. 
(That is, the aggregation is finalized.) But, we want to insert into `dst`. In 
the catalog schema, what type should we declare `X` to be? It is certainly 
_not_ `VARCHAR` because, as noted above, we could not further aggregate. So, 
what is it?
   
   The `StringLastAggregatorFactory` tells us that the intermediate type is 
`COMPLEX<serializablePairLongString>`. If we knew that in the SQL layer, we 
could reject the following. Notice the change of type of `A` to `BIGINT`:
   
   ```sql
   INSERT INTO dst
   SELECT LATEST(A) AS X FROM TABLE(...) (A BIGINT)
   ```
   
   Unfortunately, the COMPLEX<serializablePairLongString>` type is a bit too 
general, since that is the same type for `EARLIEST`. However, we would like to 
reject the following in the case where `X` wants to be `LATEST` to prevent the 
catastrophic combining of the two aggregators on the same column in different 
ingestions:
   
   ```sql
   INSERT INTO dst
   SELECT EARLIEST(A) AS X FROM TABLE(...) (A VARCHAR)
   ```
   
   So, we need an internal type of something like `COMPLEX<latest-string>`. 
Unfortunately, there is no such type in Druid: the information about the 
aggregate function is lost. So we need to do something or other to fix this.
   
   The challenge for this PR is that SQL knows nothing about the internal types 
(that I can find). Our SQL aggregators report only the finalized type 
(`VARCHAR` in this example.) As a result, you're trying to fix a bug without 
really having the internal structure needed. Let's hope a simple fix can be 
found that solves this one PR. The actual issue is to reference columns in the 
output (per SQL) and not the input.
   
   Or, perhaps, we have to resolve the larger issue. For MSQ, we have to know 
if the `LATEST(VARCHAR)` function returns:
   
   * `VARCHAR` if MSQ is used to query, or if the user is doing the "aggregate 
in the `SELECT`, store the result as detail" case.
   * `COMPLEX<latest-string>` if MSQ is used to ingest, and the `SELECT` is 
giving the aggregation to use both on ingestion and compaction. The SQL type 
COMPLEX<latest-string>` maps to the Druid type 
`COMPLEX<serializablePairLongString>` at the storage level.
   
   The Druid Calcite-based planner assumes the first case. (The type of 
`LATEST(VARCHAR)` is `VARCHAR`). But, to make `LATEST` work in compaction, we 
need the second form of the type, which we don't have at present.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #13793: Add validation for aggregations on __time

Reply via email to