[GitHub] [druid] paul-rogers commented on pull request #13793: Add validation for aggregations on __time

via GitHub Sun, 12 Feb 2023 13:05:21 -0800


paul-rogers commented on PR #13793:
URL: https://github.com/apache/druid/pull/13793#issuecomment-1427132705


   This is an interesting issue as it touches on a fundamental ambiguity in 
MSQ. MSQ uses a `SELECT` to perform aggregation. There are two ways to 
interpret such aggregations:
   
   * Do the aggregations within the `SELECT` and insert the resulting rows as 
detail rows into a detail datasource.
   * Do the aggregations within not just the `SELECT` but also during 
compaction. Insert aggregated rows into a rollup datasource.
   
   The first interpretation is standard SQL in which each `SELECT` is a black 
box which simply returns rows. The second interpretation is required for 
compaction to work. In this case, if we write the finalized `EARLIEST` into the 
datasource, compaction cannot then take two segments with the same time period 
and combine them to find the "earliest EARLIEST".
   
   As a result, we want MSQ semantics to differ from those of SQL: the contents 
of the `SELECT` statements "leak out" and are written to segments so that 
compaction can pick up where MSQ left off and continue aggregating. This is 
non-standard SQL, and causes us to rethink many parts of SQL handling (and to 
work around the fact that Calcite assumes SQL semantics.)
   
   Now to this specific case. Suppose I write my query as `LATEST_BY(foo, 
timestamp)` where `timestamp` is a column in the external table. This presents 
a variety of issues.
   
   * The `timestamp` column is probably not a `TIMESTAMP`. So, I must write 
`LATEST_BY(foo, TIME_PARSE(timestamp))`. This solves the first case above: the 
SQL is a black box.
   * SQL semantics say that I can use SELECT columns in an aggregate. So, it 
should be perfectly legal to say `LATEST_BY(foo, __time)` where `__time` has 
been defined as `TIME_PARSE(timestamp) AS __time`.
   * If we go with the `LATEST_BY(foo, TIME_PARSE(timestamp))`, then we cannot 
do the rollup use case: the `timestamp` column does not exist in the generated 
segments, and so compaction cannot further compact the columns.
   * A workaround might be to write `timestamp` into the segment so it is 
available for compaction. However, this will be confusing for the user: why 
must they save the same column twice, so that a column name from the _input_ is 
available in the _output_?
   
   Or, perhaps the goal is that `LATEST_BY` cannot be further aggregated 
(because the input column is no longer available.) If that is the goal, one 
could argue that the result is a bug. If we force users to use `LATEST_BY` 
because `LATEST` is broken, but `LATEST_BY` is also broken, we've not actually 
made much progress.
   
   So, what should we do? We should ensure that aggregators refer to columns 
from the `SELECT` clause, not the input. That is:
   
   ```sql
   INSERT INTO dst
   SELECT TIME_PARSE(timestamp) as __time, SUM(CAST(value) AS BIGINT)) AS mySum
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   Sums the `value` column after parsing. The change here suggests I should 
have written:
   
   ```sql
   INSERT INTO dst
   SELECT TIME_PARSE(timestamp) as __time, SUM(value) AS mySum
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   That is, I have to `SUM()` the input column: but that input column is a 
`VARCHAR` and not valid to sum.
   
   Let's apply this to the case at hand. What we want is to refer to the output 
columns:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST_BY(CAST(value) AS BIGINT, __time) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   Or, better, because `__time` is, in fact, an output column why not just use 
`LATEST`:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST(CAST(value) AS BIGINT) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   If we go with the rule here, then I would have to write:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST_BY(value, timestamp) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   The proposed solution goes against SQL semantics, but cannot even be made 
right. The types of both `value` and `timestamp` are wrong. To fix that I 
suppose I could use a nested select:
   
   ```sql
   INSERT INTO dst
   SELECT
     __time,
     LATEST_BY(numValue, __time) AS latestValue
   FROM
     SELECT
       TIME_PARSE(timestamp) as __time,
       CAST(value AS DOUBLE) as numValue
     FROM TABLE(EXTERN(...))
       (timestamp VARCHAR, value VARCHAR
   ```
   
   Still, however, to follow SQL, the `_time` in `LATSET_BY` would be the one 
from the outer `SELECT`. But, MSQ needs it to be the one from the inner 
`SELECT`. We will find we must rewrite Calcite (and rethink hundreds of page of 
SQL semantics) to make that work.
   
   In short, this bug could use a bit more design thinking before we propose a 
code change.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #13793: Add validation for aggregations on __time

Reply via email to