Re: Calcite on Traces / Industry 4.0 data

Julian Feinauer Sun, 21 Oct 2018 22:04:45 -0700

Hi Mark,

thanks for your reply.
In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
I was really excited when I first heard about MATCH_RECOGNIZE as it is 
incredibly powerful and could be used so solve many of the problems I state in 
my mail.
The only "drawback" I see is that it feels so technical and complex.
By that I mean that it took me quite a while to figure out how to use it (and I 
would consider myself as experienced SQL user). And it kind of "breaks" the 
foundation of SQL in the sense that it is pretty imperative and not to 
declarative.


This is no general critics to the feature. The point I'm trying to make is that 
there is a (from my perspective) large class of similar problems and I would 
love to have a solution which "feels" natural and offers suitable "semantics" 
for the field.

But coming back to the MATCH_RECOGNIZE support in Calcite, is there any 
progress with regards to Julians Post from July?
If not I can offer to give some support with the implementation of the FSM / 
NFA.
One solution for us could then also be to take a query in the "Timeseries 
SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.

So if there is still help needed please let me know (a quick search through the 
JIRA showed CALCITE-1935) which seems like there is still some implementation 
missing.

Best
Julian


Am 22.10.18, 02:41 schrieb "Mark Hammond" <[email protected]>:

    Hi Julian Feinauer,
    
    Do share your thoughts on MATCH_RECOGNIZE operator suitability, 
http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%[email protected]%3e
    
    Cheers,
    Mark.
    
    > On 22 Oct 2018, at 02:24, Julian Feinauer <[email protected]> 
wrote:
    > 
    > Dear calcite devs,
    > 
    > I follow the project for a long time and love how calcite made it 
possible to use SQL everywhere (have done several sql interfaces on top of 
specific file formats myself). I also like the strong support for streaming SQL.
    > 
    > The reason I'm writing this email is not only to give the project some 
love but because we are thinking about a SQL "extension" which I think is not 
so specific but could serve others as well in different use cases.
    > 
    > In detail, we are working with Streams of Data from Devices (think of 
industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache 
PLC4X project where I contribute) and do analytics on them. The analysis which 
are done there are pretty similar when working with traces from tests, e.g., 
automotive test drives or from related industries. What all these streams have 
in  common is
    > * usually ordered by time
    > * elements of different groups of signals ("rows" from "tables") arrive 
ordered by time but not with equal timestamps, e.g., time each second, other 
quantities much more frequent
    > * "natural" join for these signal groups ("tables") is some kind of 
interpolation (sample and hold, linear interpolation, splinces, ...) with 
respect to (event-)time
    > * In some cases signal types are not known and can only be guessed based 
on first value, e.g., on CAN there is no strict notion of "double" or "integer" 
channels but rather there are integer base values + a conversion formula (like 
a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
    > 
    > On the other hand the analysis we like to perform are often timestamps
    > * get timestamps where a condition becomes true
    >  * boolean value toggled
    >  * numeric value is above / below threshold
    >  * signal change rate is above / below threshold
    >  * ...
    > * get the values of certain signals at the point in time when a condition 
becomes true (see above)
    > * get windows based on conditions
    >  * while signal is true
    >  * while value above ...
    >  * ...
    > * Do aggregations on signals in the mentioned windows
    > 
    > Parts of this could done in most SQL dialects (I'm no expert for the 
standard but in Postgres one could use LAG and partitions) but this is not 
efficient and not all of the above could be done with that.
    > So we think about an extension (or a dialect) for "traces" or "time 
series" which has a syntax that is slightly extended to allow such queries as 
stated above.
    > 
    > To give you an example of what such an extension could look like:
    > 
    > ```
    > SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE 
cycle_in_progress = TRUE
    > SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE 
cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    > SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE 
cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    > ```
    > 
    > Why am I bothering you with this?
    > Well, first, you are experts and I would love to get some feedback on 
thoughts of that.
    > But, most important, I am thinking about writing (yet another) SQL parser 
with slight extensions and would then have to care for a "runtime" which would 
be partially similar (in functionality, not in maturity or sophistication) to 
Calcites Enumerable-Trait. So I was thinking whether there is a way to make all 
of this work "on top" of Calcite (custom RelNodes and an extension to the 
parser) but I'm unsure about that as some of the internals of Calcite are tied 
very specifically to Sql... like, e.g., SqlToRelConverter.
    > Do you have any ideas on how one would be able to implement this 
"minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or 
if this should then be done in the same codebase (e.g. a subproject) as it 
would need some changes near Calcites core?
    > 
    > Please excuse this rather long email but I would really appreciate any 
answers, comments or suggestions.
    > 
    > Best
    > Julian

Re: Calcite on Traces / Industry 4.0 data

Reply via email to