Relational algebra and signal processing

Julian Feinauer Sun, 16 Dec 2018 15:41:28 -0800

Hi Calcite-devs,

I just had a very interesting mail exchange with Julian (Hyde) on the incubator 
list [1]. It was about our project CRUNCH (which is mostly about time series 
analyses and signal processing) and its relation to relational algebra and I 
wanted to bring the discussion to this list to continue here.
We already had some discussion about how time series would work in calcite [2] 
and it’s closely related to MATCH_RECOGNIZE.

But, I have a more general question in mind, to ask the experts here on the
list.
I ask myself if we can see the signal processing and analysis tasks as proper
application of relational algebra.
Disclaimer, I’m mathematician, so I know the formals of (relational) algebra
pretty well but I’m lacking a lot of experience and knowledge in the database
theory. Most of my knowledge there comes from Calcites source code and the book
from Garcia-Molina and Ullman).

So if we take, for example, a stream of signals from a sensor, then we can of
course do filtering or smoothing on it and this can be seen as a mapping
between the input relation and the output relation. But as we usually need more
than just one tuple at a time we lose many of the advantages of the relational
theory. And then, if we analyze the signal, we can again model it as a mapping
between relations, but the input relation is a “time series” and the output
relation consists of “events”, so these are in some way different dimensions.
In this situation it becomes mostly obvious where the main differences between
time series and relational algebra are. Think of something simple, an event
should be registered, whenever the signal switches from FALSE to TRUE (so not
for every TRUE). This could also be modelled with MATCH_RECOGNIZE pretty
easily. But, for me it feels “unnatural” because we cannot use any indices (we
don’t care about the ratio of TRUE and FALSE in the DB, except for probably
some very rough outer bounds). And we are lacking the “right” information for
the optimizer like estimations on the number of analysis results.
It gets even more complicated when moving to continuous valued signals (INT,
DOUBLE, …), e.g., temperature readings or something.
If we want to analyze the number of times where we have a temperature change of
more than 5 degrees in under 4 hours, this should also be doable with
MATCH_RECOGNIZE but again, there is no index to help us and we have no
information for the optimizer, so it feels very “black box” for the relational
algebra.

I’m not sure if you get my point, but for me, the elegance of relational
algebra was always this optimization stuff, which comes from declarative and
ends in an “optimal” physical plan. And I do not see how we can use much of
this for the examples given above.

Perhaps, one solution would be to do the same as for spatial queries (or the
JSON / JSONB support in postgres, [3]) to add specialized indices, statistics
and optimizer rules. Then, this would make it more “relational algebra”-esque
in the sense that there really is a possibility to apply transformations to a
given query.

What do you think? Do I see things to complicated or am I missing something?

Julian

[1]
https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
[2]
https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
[3] https://www.postgresql.org/docs/9.4/datatype-json.html

Relational algebra and signal processing

Reply via email to