paul-rogers commented on PR #13793:
URL: https://github.com/apache/druid/pull/13793#issuecomment-1427132705
This is an interesting issue as it touches on a fundamental ambiguity in
MSQ. MSQ uses a `SELECT` to perform aggregation. There are two ways to
interpret such aggregations:
* Do the aggregations within the `SELECT` and insert the resulting rows as
detail rows into a detail datasource.
* Do the aggregations within not just the `SELECT` but also during
compaction. Insert aggregated rows into a rollup datasource.
The first interpretation is standard SQL in which each `SELECT` is a black
box which simply returns rows. The second interpretation is required for
compaction to work. In this case, if we write the finalized `EARLIEST` into the
datasource, compaction cannot then take two segments with the same time period
and combine them to find the "earliest EARLIEST".
As a result, we want MSQ semantics to differ from those of SQL: the contents
of the `SELECT` statements "leak out" and are written to segments so that
compaction can pick up where MSQ left off and continue aggregating. This is
non-standard SQL, and causes us to rethink many parts of SQL handling (and to
work around the fact that Calcite assumes SQL semantics.)
Now to this specific case. Suppose I write my query as `LATEST_BY(foo,
timestamp)` where `timestamp` is a column in the external table. This presents
a variety of issues.
* The `timestamp` column is probably not a `TIMESTAMP`. So, I must write
`LATEST_BY(foo, TIME_PARSE(timestamp))`. This solves the first case above: the
SQL is a black box.
* SQL semantics say that I can use SELECT columns in an aggregate. So, it
should be perfectly legal to say `LATEST_BY(foo, __time)` where `__time` has
been defined as `TIME_PARSE(timestamp) AS __time`.
* If we go with the `LATEST_BY(foo, TIME_PARSE(timestamp))`, then we cannot
do the rollup use case: the `timestamp` column does not exist in the generated
segments, and so compaction cannot further compact the columns.
* A workaround might be to write `timestamp` into the segment so it is
available for compaction. However, this will be confusing for the user: why
must they save the same column twice, so that a column name from the _input_ is
available in the _output_?
Or, perhaps the goal is that `LATEST_BY` cannot be further aggregated
(because the input column is no longer available.) If that is the goal, one
could argue that the result is a bug. If we force users to use `LATEST_BY`
because `LATEST` is broken, but `LATEST_BY` is also broken, we've not actually
made much progress.
So, what should we do? We should ensure that aggregators refer to columns
from the `SELECT` clause, not the input. That is:
```sql
INSERT INTO dst
SELECT TIME_PARSE(timestamp) as __time, SUM(CAST(value) AS BIGINT)) AS mySum
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
Sums the `value` column after parsing. The change here suggests I should
have written:
```sql
INSERT INTO dst
SELECT TIME_PARSE(timestamp) as __time, SUM(value) AS mySum
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
That is, I have to `SUM()` the input column: but that input column is a
`VARCHAR` and not valid to sum.
Let's apply this to the case at hand. What we want is to refer to the output
columns:
```sql
INSERT INTO dst
SELECT
TIME_PARSE(timestamp) as __time,
LATEST_BY(CAST(value) AS BIGINT, __time) AS latestValue
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
Or, better, because `__time` is, in fact, an output column why not just use
`LATEST`:
```sql
INSERT INTO dst
SELECT
TIME_PARSE(timestamp) as __time,
LATEST(CAST(value) AS BIGINT) AS latestValue
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
If we go with the rule here, then I would have to write:
```sql
INSERT INTO dst
SELECT
TIME_PARSE(timestamp) as __time,
LATEST_BY(value, timestamp) AS latestValue
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
The proposed solution goes against SQL semantics, but cannot even be made
right. The types of both `value` and `timestamp` are wrong. To fix that I
suppose I could use a nested select:
```sql
INSERT INTO dst
SELECT
__time,
LATEST_BY(numValue, __time) AS latestValue
FROM
SELECT
TIME_PARSE(timestamp) as __time,
CAST(value AS DOUBLE) as numValue
FROM TABLE(EXTERN(...))
(timestamp VARCHAR, value VARCHAR
```
Still, however, to follow SQL, the `_time` in `LATSET_BY` would be the one
from the outer `SELECT`. But, MSQ needs it to be the one from the inner
`SELECT`. We will find we must rewrite Calcite (and rethink hundreds of page of
SQL semantics) to make that work.
In short, this bug could use a bit more design thinking before we propose a
code change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]