Dear calcite devs,
I follow the project for a long time and love how calcite made it possible to
use SQL everywhere (have done several sql interfaces on top of specific file
formats myself). I also like the strong support for streaming SQL.
The reason I'm writing this email is not only to give the project some love but
because we are thinking about a SQL "extension" which I think is not so
specific but could serve others as well in different use cases.
In detail, we are working with Streams of Data from Devices (think of industry
4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X
project where I contribute) and do analytics on them. The analysis which are
done there are pretty similar when working with traces from tests, e.g.,
automotive test drives or from related industries. What all these streams have
in common is
* usually ordered by time
* elements of different groups of signals ("rows" from "tables") arrive ordered
by time but not with equal timestamps, e.g., time each second, other quantities
much more frequent
* "natural" join for these signal groups ("tables") is some kind of
interpolation (sample and hold, linear interpolation, splinces, ...) with
respect to (event-)time
* In some cases signal types are not known and can only be guessed based on
first value, e.g., on CAN there is no strict notion of "double" or "integer"
channels but rather there are integer base values + a conversion formula (like
a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
On the other hand the analysis we like to perform are often timestamps
* get timestamps where a condition becomes true
* boolean value toggled
* numeric value is above / below threshold
* signal change rate is above / below threshold
* ...
* get the values of certain signals at the point in time when a condition
becomes true (see above)
* get windows based on conditions
* while signal is true
* while value above ...
* ...
* Do aggregations on signals in the mentioned windows
Parts of this could done in most SQL dialects (I'm no expert for the standard
but in Postgres one could use LAG and partitions) but this is not efficient and
not all of the above could be done with that.
So we think about an extension (or a dialect) for "traces" or "time series"
which has a syntax that is slightly extended to allow such queries as stated
above.
To give you an example of what such an extension could look like:
```
SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE
cycle_in_progress = TRUE
SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE
cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE
cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
```
Why am I bothering you with this?
Well, first, you are experts and I would love to get some feedback on thoughts
of that.
But, most important, I am thinking about writing (yet another) SQL parser with
slight extensions and would then have to care for a "runtime" which would be
partially similar (in functionality, not in maturity or sophistication) to
Calcites Enumerable-Trait. So I was thinking whether there is a way to make all
of this work "on top" of Calcite (custom RelNodes and an extension to the
parser) but I'm unsure about that as some of the internals of Calcite are tied
very specifically to Sql... like, e.g., SqlToRelConverter.
Do you have any ideas on how one would be able to implement this "minimaly
invasive" on top of Calcite and whether this is possible "ex-situ" or if this
should then be done in the same codebase (e.g. a subproject) as it would need
some changes near Calcites core?
Please excuse this rather long email but I would really appreciate any answers,
comments or suggestions.
Best
Julian