I’ve been thinking a bit more about this use case.
It’s tricky because signal processing and SQL have a different model of
“event”. EE folks talk about “edge triggering” and “level triggering”, which
are really just two different data models for the same physical phenomenon.
Consider an example of a step function. Suppose a signal is 0 until time 3.25
when it becomes 1. SQL would probably represent this as an event
(timestamp=0, value=0)
and another event
(timestamp=3.25, value=1)
But to use MATCH_RECOGNIZE effectively, you would need a stream of events
{(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), (5, 1), …}
And even then, the stream does not capture exactly when the transition occurs,
just that it happens sometime between 3 and 4.
We could provide a transform in SQL that converts the edge events
{(0, 0), (3.25, 1)}
into level events on clock ticks
{(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), …}
We could also provide a transform from an event stream that has time gaps; e.g.
{(0, 0), (1, 0), (6, 1)}
becomes
{(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 1)}
These transforms produce virtual data streams in a form that is more easy to
write SQL on. For example, joins are easier if you can guarantee that there is
a value for every clock tick. So are windowed aggregates. They also bring
streams, constant tables and time-varying tables under the same roof.
Do you think this approach is worthwhile? Are there other data models in signal
processing/event processing that we could bring into the relational model by
applying transforms?
Julian
> On Oct 25, 2018, at 1:58 PM, Julian Feinauer <[email protected]>
> wrote:
>
> .... I just noted that I forgot to comment on Flinks Implementation, sorry.
>
> I went through the patch which implemented basic functionality in the
> master[1] and I think that we cannot learn much from their approach directly
> as they reduce it to a CEP Pattern which is then forwarded to CEP where most
> of the magic happens.
> Thus, what they implemented now to make this feature work is, from my
> impression, on the level whats already implemented with the parsing and the
> LogicalMatch.
>
> Sorry for the two emails
> Julian
>
> [1]
> https://github.com/apache/flink/commit/3acd92b45c21e081f781affc8cb5700d972f9b0b
>
> Am 25.10.18, 22:46 schrieb "Julian Feinauer" <[email protected]>:
>
> Hi Julian,
>
> I filed a Jira form y general suggestion about "Timeseries SQL"
> (CALCITE-2640).
> For the discussion in the other thread, I had a look into the present
> state of the code (from you and Zhiqiang He) for parsing and the logical node.
>
> I also thought about the necessary implementation for the EnumerableMatch.
> I'm pretty familiar with the regex to NFA / DFS part (from our
> implementations) and the define part.
> But what I'm pretty unfamiliar with is the order and partition part (and
> especially how its implemented in Calcite).
> Do you see any possibility to transform the Matching Part into a Window
> Aggregation function, or do I make things overly easy with this thought?
>
> Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST,
> LAST window agg functions?
> I can try to help with the implementation of the "inner" parts but I don’t
> feel that I'm familiar enough with the codebase to make the whole thing work.
>
> Thus, if anybody of the seasoned Calcite devs could offer some help I
> would be happy to discuss details of the implementation and support the
> implementation as good as possible.
>
> Best
> Julian
>
>
> Am 23.10.18, 07:57 schrieb "Julian Feinauer"
> <[email protected]>:
>
> Hi Julian,
>
> first of thanks for your reply and your thoughts.
> Thinking about your arguments, I fully agree to what you say and we
> should really consider using MATCH_REGOCNIZE first and see where it gets us.
>
> To our second "problem", the different channel groups (with unequal
> time stamps), we also need a sound mapping to SQL then. My first thought was
> to use the "drill approach" and to simply simulate a table which has all
> columns somebody wants (as we do not know that upfront) and return NULL or
> NaN values when the channel is not present at evaluation time (and do all the
> interpolation and stuff in the background). Or does anybody have a better
> idea?
>
> For your suggested approach I agree and will try to write some of our
> analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and
> come back then to the list.
>
> Thanks
> Julian
>
> Am 23.10.18, 05:55 schrieb "Julian Hyde" <[email protected]>:
>
> Julian,
>
> Thanks for posting this to Calcite. We appreciate the opportunity
> to mull over a language and prevent a mis-guided SQL-like language.
>
> I agree with both you and Mark: MATCH_RECOGNIZE seems to be very
> well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and
> difficult to learn.
>
> But in its favor, MATCH_RECOGNIZE is in the SQL standard and has
> reference implementations in systems like Oracle, so we can assume that it is
> well-specified. And, in my opinion, it is well designed - it delivers
> significant extra power to SQL that could not be done efficiently or at all
> without it, and is consistent with existing SQL semantics. Lastly, the
> streaming systems such as Flink and Beam are adopting it.
>
> When your proposed language has gone through the same process, I
> suspect that it would end up being very similar to MATCH_RECOGNIZE.
> MATCH_RECOGNIZE may seem “imperative” because it it is creating a
> state-transition engine, but finite-state automata can be reasoned and safely
> transformed, and are therefore to all intents and purposes “declarative”.
>
> The biggest reason not to use MATCH_RECOGNIZE is your audience.
> There’s no point creating the perfect language if the audience doesn’t like
> it and want to adopt it. So perhaps your best path is to design your own
> language, find some examples and code them up as use cases in that language,
> and iterate based on your users’ feedback.
>
> If I were you, I would also code each of those examples in SQL
> using MATCH_RECOGNIZE, and make sure that there is a sound mapping between
> those languages. And maybe your language could be implemented as a thin layer
> above MATCH_RECOGNIZE.
>
> This is the same advice I would give to everyone who is writing a
> database: I don’t care whether you use SQL, but make sure your language maps
> onto (extended) relational algebra. (And if you create a SQL-like language
> that breaks some of the concepts of SQL, such automatically joining tables,
> please don’t tell people that your language is SQL.)
>
> I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE
> has not moved forward much since my email. Maybe your effort is the kick
> necessary to get it going. I can assure you that I still believe that
> MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
>
> Julian
>
>
>
>> On Oct 21, 2018, at 10:04 PM, Julian Feinauer <[email protected]>
>> wrote:
>>
>> Hi Mark,
>>
>> thanks for your reply.
>> In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original
>> mail.
>> I was really excited when I first heard about MATCH_RECOGNIZE as it is
>> incredibly powerful and could be used so solve many of the problems I state
>> in my mail.
>> The only "drawback" I see is that it feels so technical and complex.
>> By that I mean that it took me quite a while to figure out how to use it
>> (and I would consider myself as experienced SQL user). And it kind of
>> "breaks" the foundation of SQL in the sense that it is pretty imperative and
>> not to declarative.
>>
>> This is no general critics to the feature. The point I'm trying to make is
>> that there is a (from my perspective) large class of similar problems and I
>> would love to have a solution which "feels" natural and offers suitable
>> "semantics" for the field.
>>
>> But coming back to the MATCH_RECOGNIZE support in Calcite, is there any
>> progress with regards to Julians Post from July?
>> If not I can offer to give some support with the implementation of the FSM /
>> NFA.
>> One solution for us could then also be to take a query in the "Timeseries
>> SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
>>
>> So if there is still help needed please let me know (a quick search through
>> the JIRA showed CALCITE-1935) which seems like there is still some
>> implementation missing.
>>
>> Best
>> Julian
>>
>>
>> Am 22.10.18, 02:41 schrieb "Mark Hammond" <[email protected]>:
>>
>> Hi Julian Feinauer,
>>
>> Do share your thoughts on MATCH_RECOGNIZE operator suitability,
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%[email protected]%3e
>>
>> Cheers,
>> Mark.
>>
>>> On 22 Oct 2018, at 02:24, Julian Feinauer <[email protected]>
>>> wrote:
>>>
>>> Dear calcite devs,
>>>
>>> I follow the project for a long time and love how calcite made it possible
>>> to use SQL everywhere (have done several sql interfaces on top of specific
>>> file formats myself). I also like the strong support for streaming SQL.
>>>
>>> The reason I'm writing this email is not only to give the project some love
>>> but because we are thinking about a SQL "extension" which I think is not so
>>> specific but could serve others as well in different use cases.
>>>
>>> In detail, we are working with Streams of Data from Devices (think of
>>> industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache
>>> PLC4X project where I contribute) and do analytics on them. The analysis
>>> which are done there are pretty similar when working with traces from
>>> tests, e.g., automotive test drives or from related industries. What all
>>> these streams have in common is
>>> * usually ordered by time
>>> * elements of different groups of signals ("rows" from "tables") arrive
>>> ordered by time but not with equal timestamps, e.g., time each second,
>>> other quantities much more frequent
>>> * "natural" join for these signal groups ("tables") is some kind of
>>> interpolation (sample and hold, linear interpolation, splinces, ...) with
>>> respect to (event-)time
>>> * In some cases signal types are not known and can only be guessed based on
>>> first value, e.g., on CAN there is no strict notion of "double" or
>>> "integer" channels but rather there are integer base values + a conversion
>>> formula (like a x + b) + possible lookup tables for "other" values (SNA,
>>> NULL, DISABLED, ...)
>>>
>>> On the other hand the analysis we like to perform are often timestamps
>>> * get timestamps where a condition becomes true
>>> * boolean value toggled
>>> * numeric value is above / below threshold
>>> * signal change rate is above / below threshold
>>> * ...
>>> * get the values of certain signals at the point in time when a condition
>>> becomes true (see above)
>>> * get windows based on conditions
>>> * while signal is true
>>> * while value above ...
>>> * ...
>>> * Do aggregations on signals in the mentioned windows
>>>
>>> Parts of this could done in most SQL dialects (I'm no expert for the
>>> standard but in Postgres one could use LAG and partitions) but this is not
>>> efficient and not all of the above could be done with that.
>>> So we think about an extension (or a dialect) for "traces" or "time series"
>>> which has a syntax that is slightly extended to allow such queries as
>>> stated above.
>>>
>>> To give you an example of what such an extension could look like:
>>>
>>> ```
>>> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE
>>> cycle_in_progress = TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE
>>> cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE
>>> cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> ```
>>>
>>> Why am I bothering you with this?
>>> Well, first, you are experts and I would love to get some feedback on
>>> thoughts of that.
>>> But, most important, I am thinking about writing (yet another) SQL parser
>>> with slight extensions and would then have to care for a "runtime" which
>>> would be partially similar (in functionality, not in maturity or
>>> sophistication) to Calcites Enumerable-Trait. So I was thinking whether
>>> there is a way to make all of this work "on top" of Calcite (custom
>>> RelNodes and an extension to the parser) but I'm unsure about that as some
>>> of the internals of Calcite are tied very specifically to Sql... like,
>>> e.g., SqlToRelConverter.
>>> Do you have any ideas on how one would be able to implement this "minimaly
>>> invasive" on top of Calcite and whether this is possible "ex-situ" or if
>>> this should then be done in the same codebase (e.g. a subproject) as it
>>> would need some changes near Calcites core?
>>>
>>> Please excuse this rather long email but I would really appreciate any
>>> answers, comments or suggestions.
>>>
>>> Best
>>> Julian
>>
>>
>
>
>
>
>
>
>