I think the difficulty with JulianF’s signal processing domain is that he needs there to be precisely one record at every clock tick (or more generally, at every point in an N-dimensional discrete space).
Consider stock trading. A stock trade is an event that happens in continuous time, say (9:58:02 ORCL 41), (10:01:55 ORCL 43) Our query wants to know the stock price at 10:00 (or at any 1-minute interval). Therefore we have to convert the event-oriented data into an array: (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43). JulianF’s domain may be more naturally in the realm of array databases [1] but there are a lot of advantages to relational algebra and SQL, not least that we have reasonable story for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that converts an event-based relation into a dense array: SELECT * FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE)) Now we can safely join with other data at the same granularity. Is this a step in the right direction? Julian [1] https://en.wikipedia.org/wiki/Array_DBMS > On Dec 18, 2018, at 7:05 AM, Michael Mior <[email protected]> wrote: > > I would say a similar theory applies. Some things are different when you're > dealing with streams. Mainly joins and aggregations. Semantics are > necessarily different whenever you have operations involving more than one > row at a time from the input stream. When dealing with a relation an > aggregation is straightforward since you just consume the entire input, and > output the result of the aggregation. Since streams don't end, you need to > decide how this is handled which usually amounts to a choice of windowing > algorithm. There are a few other things to think about. The presentation > linked below from Julian Hyde has a nice overview > > https://www.slideshare.net/julianhyde/streaming-sql-62376119 > > -- > Michael Mior > [email protected] > > > Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <[email protected]> > a écrit : > >> Hi Michael, >> >> yes, our workloads are usually in the context of streaming (but for replay >> or so we also use batch). >> But, if I understand it correctly, the same theory applies to both, tables >> ("relations") and streaming tables, or? >> I hope to find time soon to write a PLC4X - Calicte source which creates >> one or many streams based on readings from a plc. >> >> Julian >> >> Am 18.12.18, 03:19 schrieb "Michael Mior" <[email protected]>: >> >> Perhaps you've thought of this already, but it sounds like streaming >> relational algebra could be a good fit here. >> >> https://calcite.apache.org/docs/stream.html >> -- >> Michael Mior >> [email protected] >> >> >> Le dim. 16 déc. 2018 à 18:39, Julian Feinauer < >> [email protected]> >> a écrit : >> >>> Hi Calcite-devs, >>> >>> I just had a very interesting mail exchange with Julian (Hyde) on the >>> incubator list [1]. It was about our project CRUNCH (which is mostly >> about >>> time series analyses and signal processing) and its relation to >> relational >>> algebra and I wanted to bring the discussion to this list to >> continue here. >>> We already had some discussion about how time series would work in >> calcite >>> [2] and it’s closely related to MATCH_RECOGNIZE. >>> >>> But, I have a more general question in mind, to ask the experts here >> on >>> the list. >>> I ask myself if we can see the signal processing and analysis tasks >> as >>> proper application of relational algebra. >>> Disclaimer, I’m mathematician, so I know the formals of (relational) >>> algebra pretty well but I’m lacking a lot of experience and >> knowledge in >>> the database theory. Most of my knowledge there comes from Calcites >> source >>> code and the book from Garcia-Molina and Ullman). >>> >>> So if we take, for example, a stream of signals from a sensor, then >> we can >>> of course do filtering or smoothing on it and this can be seen as a >> mapping >>> between the input relation and the output relation. But as we >> usually need >>> more than just one tuple at a time we lose many of the advantages of >> the >>> relational theory. And then, if we analyze the signal, we can again >> model >>> it as a mapping between relations, but the input relation is a “time >>> series” and the output relation consists of “events”, so these are >> in some >>> way different dimensions. In this situation it becomes mostly >> obvious where >>> the main differences between time series and relational algebra are. >> Think >>> of something simple, an event should be registered, whenever the >> signal >>> switches from FALSE to TRUE (so not for every TRUE). This could also >> be >>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels >>> “unnatural” because we cannot use any indices (we don’t care about >> the >>> ratio of TRUE and FALSE in the DB, except for probably some very >> rough >>> outer bounds). And we are lacking the “right” information for the >> optimizer >>> like estimations on the number of analysis results. >>> It gets even more complicated when moving to continuous valued >> signals >>> (INT, DOUBLE, …), e.g., temperature readings or something. >>> If we want to analyze the number of times where we have a temperature >>> change of more than 5 degrees in under 4 hours, this should also be >> doable >>> with MATCH_RECOGNIZE but again, there is no index to help us and we >> have no >>> information for the optimizer, so it feels very “black box” for the >>> relational algebra. >>> >>> I’m not sure if you get my point, but for me, the elegance of >> relational >>> algebra was always this optimization stuff, which comes from >> declarative >>> and ends in an “optimal” physical plan. And I do not see how we can >> use >>> much of this for the examples given above. >>> >>> Perhaps, one solution would be to do the same as for spatial queries >> (or >>> the JSON / JSONB support in postgres, [3]) to add specialized >> indices, >>> statistics and optimizer rules. Then, this would make it more >> “relational >>> algebra”-esque in the sense that there really is a possibility to >> apply >>> transformations to a given query. >>> >>> What do you think? Do I see things to complicated or am I missing >>> something? >>> >>> Julian >>> >>> [1] >>> >> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E >>> [2] >>> >> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E >>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html >>> >>> >> >> >>
