Re: Calcite on Traces / Industry 4.0 data

Julian Hyde Mon, 29 Oct 2018 12:40:26 -0700

I’ve been thinking a bit more about this use case.

It’s tricky because signal processing and SQL have a different model of 
“event”. EE folks talk about “edge triggering” and “level triggering”, which 
are really just two different data models for the same physical phenomenon.


Consider an example of a step function. Suppose a signal is 0 until time 3.25 
when it becomes 1. SQL would probably represent this as an event

  (timestamp=0, value=0)

and another event

  (timestamp=3.25, value=1)

But to use MATCH_RECOGNIZE effectively, you would need a stream of events

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), (5, 1), …}

And even then, the stream does not capture exactly when the transition occurs, 
just that it happens sometime between 3 and 4.

We could provide a transform in SQL that converts the edge events

  {(0, 0), (3.25, 1)}

into level events on clock ticks

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), …}

We could also provide a transform from an event stream that has time gaps; e.g. 

  {(0, 0), (1, 0), (6, 1)}

becomes

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 1)}

These transforms produce virtual data streams in a form that is more easy to 
write SQL on. For example, joins are easier if you can guarantee that there is 
a value for every clock tick. So are windowed aggregates. They also bring 
streams, constant tables and time-varying tables under the same roof.

Do you think this approach is worthwhile? Are there other data models in signal 
processing/event processing that we could bring into the relational model by 
applying transforms?

Julian





> On Oct 25, 2018, at 1:58 PM, Julian Feinauer <[email protected]> 
> wrote:
> 
> .... I just noted that I forgot to comment on Flinks Implementation, sorry.
> 
> I went through the patch which implemented basic functionality in the 
> master[1] and I think that we cannot learn much from their approach directly 
> as they reduce it to a CEP Pattern which is then forwarded to CEP where most 
> of the magic happens.
> Thus, what they implemented now to make this feature work is, from my 
> impression, on the level whats already implemented with the parsing and the 
> LogicalMatch.
> 
> Sorry for the two emails
> Julian
> 
> [1] 
> https://github.com/apache/flink/commit/3acd92b45c21e081f781affc8cb5700d972f9b0b
> 
> Am 25.10.18, 22:46 schrieb "Julian Feinauer" <[email protected]>:
> 
>    Hi Julian,
> 
>    I filed a Jira form y general suggestion about "Timeseries SQL" 
> (CALCITE-2640).
>    For the discussion in the other thread, I had a look into the present 
> state of the code (from you and Zhiqiang He) for parsing and the logical node.
> 
>    I also thought about the necessary implementation for the EnumerableMatch.
>    I'm pretty familiar with the regex to NFA / DFS part (from our 
> implementations) and the define part.
>    But what I'm pretty unfamiliar with is the order and partition part (and 
> especially how its implemented in Calcite).
>    Do you see any possibility to transform the Matching Part into a Window 
> Aggregation function, or do I make things overly easy with this thought?
> 
>    Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST, 
> LAST window agg functions?
>    I can try to help with the implementation of the "inner" parts but I don’t 
> feel that I'm familiar enough with the codebase to make the whole thing work.
> 
>    Thus, if anybody of the seasoned Calcite devs could offer some help I 
> would be happy to discuss details of the implementation and support the 
> implementation as good as possible.
> 
>    Best
>    Julian
> 
> 
>    Am 23.10.18, 07:57 schrieb "Julian Feinauer" 
> <[email protected]>:
> 
>        Hi Julian,
> 
>        first of thanks for your reply and your thoughts.
>        Thinking about your arguments, I fully agree to what you say and we 
> should really consider using MATCH_REGOCNIZE first and see where it gets us.
> 
>        To our second "problem", the different channel groups (with unequal 
> time stamps), we also need a sound mapping to SQL then. My first thought was 
> to use the "drill approach" and to simply simulate a table which has all 
> columns somebody wants (as we do not know that upfront) and return NULL or 
> NaN values when the channel is not present at evaluation time (and do all the 
> interpolation and stuff in the background). Or does anybody have a better 
> idea?
> 
>        For your suggested approach I agree and will try to write some of our 
> analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and 
> come back then to the list.
> 
>        Thanks
>        Julian
> 
>        Am 23.10.18, 05:55 schrieb "Julian Hyde" <[email protected]>:
> 
>            Julian,
> 
>            Thanks for posting this to Calcite. We appreciate the opportunity 
> to mull over a language and prevent a mis-guided SQL-like language.
> 
>            I agree with both you and Mark: MATCH_RECOGNIZE seems to be very 
> well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and 
> difficult to learn.
> 
>            But in its favor, MATCH_RECOGNIZE is in the SQL standard and has 
> reference implementations in systems like Oracle, so we can assume that it is 
> well-specified. And, in my opinion, it is well designed - it delivers 
> significant extra power to SQL that could not be done efficiently or at all 
> without it, and is consistent with existing SQL semantics. Lastly, the 
> streaming systems such as Flink and Beam are adopting it.
> 
>            When your proposed language has gone through the same process, I 
> suspect that it would end up being very similar to MATCH_RECOGNIZE. 
> MATCH_RECOGNIZE may seem “imperative” because it it is creating a 
> state-transition engine, but finite-state automata can be reasoned and safely 
> transformed, and are therefore to all intents and purposes “declarative”.
> 
>            The biggest reason not to use MATCH_RECOGNIZE is your audience. 
> There’s no point creating the perfect language if the audience doesn’t like 
> it and want to adopt it. So perhaps your best path is to design your own 
> language, find some examples and code them up as use cases in that language, 
> and iterate based on your users’ feedback.
> 
>            If I were you, I would also code each of those examples in SQL 
> using MATCH_RECOGNIZE, and make sure that there is a sound mapping between 
> those languages. And maybe your language could be implemented as a thin layer 
> above MATCH_RECOGNIZE.
> 
>            This is the same advice I would give to everyone who is writing a 
> database: I don’t care whether you use SQL, but make sure your language maps 
> onto (extended) relational algebra. (And if you create a SQL-like language 
> that breaks some of the concepts of SQL, such automatically joining tables, 
> please don’t tell people that your language is SQL.)
> 
>            I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE 
> has not moved forward much since my email. Maybe your effort is the kick 
> necessary to get it going. I can assure you that I still believe that 
> MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
> 
>            Julian
> 
> 
> 
>> On Oct 21, 2018, at 10:04 PM, Julian Feinauer <[email protected]> 
>> wrote:
>> 
>> Hi Mark,
>> 
>> thanks for your reply.
>> In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original 
>> mail.
>> I was really excited when I first heard about MATCH_RECOGNIZE as it is 
>> incredibly powerful and could be used so solve many of the problems I state 
>> in my mail.
>> The only "drawback" I see is that it feels so technical and complex.
>> By that I mean that it took me quite a while to figure out how to use it 
>> (and I would consider myself as experienced SQL user). And it kind of 
>> "breaks" the foundation of SQL in the sense that it is pretty imperative and 
>> not to declarative.
>> 
>> This is no general critics to the feature. The point I'm trying to make is 
>> that there is a (from my perspective) large class of similar problems and I 
>> would love to have a solution which "feels" natural and offers suitable 
>> "semantics" for the field.
>> 
>> But coming back to the MATCH_RECOGNIZE support in Calcite, is there any 
>> progress with regards to Julians Post from July?
>> If not I can offer to give some support with the implementation of the FSM / 
>> NFA.
>> One solution for us could then also be to take a query in the "Timeseries 
>> SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
>> 
>> So if there is still help needed please let me know (a quick search through 
>> the JIRA showed CALCITE-1935) which seems like there is still some 
>> implementation missing.
>> 
>> Best
>> Julian
>> 
>> 
>> Am 22.10.18, 02:41 schrieb "Mark Hammond" <[email protected]>:
>> 
>>   Hi Julian Feinauer,
>> 
>>   Do share your thoughts on MATCH_RECOGNIZE operator suitability, 
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%[email protected]%3e
>> 
>>   Cheers,
>>   Mark.
>> 
>>> On 22 Oct 2018, at 02:24, Julian Feinauer <[email protected]> 
>>> wrote:
>>> 
>>> Dear calcite devs,
>>> 
>>> I follow the project for a long time and love how calcite made it possible 
>>> to use SQL everywhere (have done several sql interfaces on top of specific 
>>> file formats myself). I also like the strong support for streaming SQL.
>>> 
>>> The reason I'm writing this email is not only to give the project some love 
>>> but because we are thinking about a SQL "extension" which I think is not so 
>>> specific but could serve others as well in different use cases.
>>> 
>>> In detail, we are working with Streams of Data from Devices (think of 
>>> industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache 
>>> PLC4X project where I contribute) and do analytics on them. The analysis 
>>> which are done there are pretty similar when working with traces from 
>>> tests, e.g., automotive test drives or from related industries. What all 
>>> these streams have in  common is
>>> * usually ordered by time
>>> * elements of different groups of signals ("rows" from "tables") arrive 
>>> ordered by time but not with equal timestamps, e.g., time each second, 
>>> other quantities much more frequent
>>> * "natural" join for these signal groups ("tables") is some kind of 
>>> interpolation (sample and hold, linear interpolation, splinces, ...) with 
>>> respect to (event-)time
>>> * In some cases signal types are not known and can only be guessed based on 
>>> first value, e.g., on CAN there is no strict notion of "double" or 
>>> "integer" channels but rather there are integer base values + a conversion 
>>> formula (like a x + b) + possible lookup tables for "other" values (SNA, 
>>> NULL, DISABLED, ...)
>>> 
>>> On the other hand the analysis we like to perform are often timestamps
>>> * get timestamps where a condition becomes true
>>> * boolean value toggled
>>> * numeric value is above / below threshold
>>> * signal change rate is above / below threshold
>>> * ...
>>> * get the values of certain signals at the point in time when a condition 
>>> becomes true (see above)
>>> * get windows based on conditions
>>> * while signal is true
>>> * while value above ...
>>> * ...
>>> * Do aggregations on signals in the mentioned windows
>>> 
>>> Parts of this could done in most SQL dialects (I'm no expert for the 
>>> standard but in Postgres one could use LAG and partitions) but this is not 
>>> efficient and not all of the above could be done with that.
>>> So we think about an extension (or a dialect) for "traces" or "time series" 
>>> which has a syntax that is slightly extended to allow such queries as 
>>> stated above.
>>> 
>>> To give you an example of what such an extension could look like:
>>> 
>>> ```
>>> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE 
>>> cycle_in_progress = TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE 
>>> cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE 
>>> cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> ```
>>> 
>>> Why am I bothering you with this?
>>> Well, first, you are experts and I would love to get some feedback on 
>>> thoughts of that.
>>> But, most important, I am thinking about writing (yet another) SQL parser 
>>> with slight extensions and would then have to care for a "runtime" which 
>>> would be partially similar (in functionality, not in maturity or 
>>> sophistication) to Calcites Enumerable-Trait. So I was thinking whether 
>>> there is a way to make all of this work "on top" of Calcite (custom 
>>> RelNodes and an extension to the parser) but I'm unsure about that as some 
>>> of the internals of Calcite are tied very specifically to Sql... like, 
>>> e.g., SqlToRelConverter.
>>> Do you have any ideas on how one would be able to implement this "minimaly 
>>> invasive" on top of Calcite and whether this is possible "ex-situ" or if 
>>> this should then be done in the same codebase (e.g. a subproject) as it 
>>> would need some changes near Calcites core?
>>> 
>>> Please excuse this rather long email but I would really appreciate any 
>>> answers, comments or suggestions.
>>> 
>>> Best
>>> Julian
>> 
>> 
> 
> 
> 
> 
> 
> 
>

Re: Calcite on Traces / Industry 4.0 data

Reply via email to