Hi Julian,

thanks for your answer and your insights.
I agree with you on many points (especially our last discussion on the Calcite 
ML made me think a lot).
So I agree with your "layered" approach, and in fact this is what we currently 
do (without stating it explicit enough, I think).

Basically, we do two thinks, I guess.. first, we provide a (Java-)DSL to make 
it easy to write specific operations (and do some very limited optimization, 
not at all comparable to what Calcite does).
Second, we also provide some functions which are useful or necessary for signal 
processing (smoothing, filtering, ...) and we plan to extend them soon with 
things like short or long term predictions, anomaly detection, ... .
By providing suitable wrappers for all that stuff we are able to translate this 
to "real" streaming engines (currently Flink and Akka Streams) and run it there.

And indeed MATCH_RECOGNIZE could be a good implementation for many situations 
(definitely not all) and I hope that I can contribute soon to your recent work 
(I will continue the discussion on the Calcite list). But overall I'm really 
unsure if our problem can be seen as a problem of relational algebra. I know 
and like the overall framework very much (it's one of the most elegant 
applications of math I've seen so far I would even say). But it feels like it 
doesn’t fit that well. As soon as you have a problem where relations are 
related, even for simple things like LAG or LEAD as window functions it gets 
pretty complicated and unnatural with regards to the definition of the algebra. 
But, as I'm lacking a lot of expertise there I would love to discuss the matter 
further with you (but again, I think we should do it on the calcite list).

The following small ASCII Image depicts my thinking of these "layers", and from 
our perspective MATCH_RECOGNIZE is one way to solve the problem and we can also 
provide "native" blocks to run directly on a streaming engine and there are 
surely pros and cons for both sides:

                O CRUNCH Evaluation
                |
        ----------------------
        |               |
    STREAM       Rel. Expression with MATCH_RECOGNIZE
        |               |       
   Streaming Engines    |
                        |
                SQL based Engines

So, I'm not exactly sure what approach you would prefer from your mail, but my 
suggestion for the next steps with CRUNCH would be to enrich the DSL, add more 
domain specific functions, find more use-cases and get more users on-board. So 
to say, work on the semantics side of things. But in parallel we should follow 
a path to get a better separation of "business logic" and execution with 
support for multiple frameworks and especially the relational algebra side. 
Perhaps, we can conclude at one point that we can cover everything by Calcite 
(I'm skeptical right now) but I think whats needed for this discussion is a 
valid basis to also show you calcite devs what exactly we are doing in-depth.

Julian


Am 16.12.18, 08:20 schrieb "Julian Hyde" <jh...@apache.org>:

    Hi Julian,
    
    Regarding whether to do this as a streaming engine (with its own query 
language) or as a framework above a streaming engine, I’d say that’s a false 
choice. If there is relational algebra inside your system, you can provide a 
high-level query language that can be translated to a lower-level query 
language in a streaming engine.
    
    This approach of “layered” databases has worked well for me for several 
projects, and is ever more applicable these days as data is becoming federated.
    
    You and I have discussed SQL’s MATCH_RECOGNIZE clause as a way to build 
complex time-based logic. You have probably noticed that is now in Flink, I am 
working on it in Calcite, and Beam will probably get it at some point. Even if 
MATCH_RECOGNIZE doesn’t solve your problem, let’s follow the same approach - 
convert your problem to a DSL that maps to or extends relational algebra, and 
then figure out how to translate that to SQL in an underlying engine. Calcite 
is a very good platform for building new “data languages”, so let’s carry on 
talking.
    
    Julian
    
    
    > On Dec 14, 2018, at 2:11 AM, Julian Feinauer 
<j.feina...@pragmaticminds.de> wrote:
    > 
    > Hi all,
    > 
    > I just joined the incubator ML and wanted to present myself and possibly 
also start a discussion about a software project we developed in the past.
    > But first things first. My name is Julian Feinauer and I come from 
Germany where I run two “start-up” companies where we work a lot on the 
“industrial IoT” topics, data science and processing of “larger amounts of 
data”. We love open source and so we love the ASF. Most notably, I closely 
follow the Apache Calcite project and hopefully find some time soon to 
contribute a bit more than in the last monts. Futhermore, I am engaged in the 
(incubating) PLC4X project as (P)PMC and in the  (incubating) Edgent project 
where I try to “revive” the community as new (P)PMC together with Christopher 
Dutz.
    > 
    > Now to the real topic. Over the last 3 years I started to develop a 
“Framework/Library” (currently a set of jars) to facilitate processing of 
timeseries data. The focus is mostly on processing of data from test stands, 
e.g., automotive tests, driving profiles and so on. Furthermore, in the recent 
year we added a lot of functionality for processing of “industrial data”. This 
means that we want to make it easy to analyze things like “how long did the 
machine spend in this state”, “when are the following set of bits set” or 
“nofity when the following conditions is true for the first time”.
    > It is a bit technical and I don’t want to go too deep into it, but 
generally speaking we try to introduce the “right” semantics to answer the 
typical questions when analyzing machine or test data. This project is called 
“CRUNCH” and we are in the process of making it open source (will be moved to a 
public github repo in this year) under the Apache 2.0 License.
    > 
    > As there can be seen a close relationship to other (incubating or TLP) 
projects we are thinking about if this project could fit into the incubator. 
Some examples for Apache projects that we see as “related” are Apache Flink 
(which we can use as the Streaming Engine to process the stream), (incubating) 
Edgent which we also can support as Streaming Engine and where we try to find a 
suitable project goal and community currently as some of the (P)PMC members 
retired or went inactive. Finally, CRUNCH has a very natural fit with PLC4X 
because it can directly process the data gathered form PLCs (and in fact we are 
already using it in some of our projects that way). I had several discussions 
with some of the (P)PMCs of PLC4X, namely Sebastian Rühl and Christpher Dutz wo 
encouraged me to introduce the project to the incubator because they also see 
some potential for the project to enrich the OSS ecosystem with regards to edge 
/ stream processing of (I)IoT data.
    > 
    > So please feel free to ask questions or discuss your view on this topic 
as I would like to find out if this project could fit in the Apache Ecosystem 
and the Incubator or not.
    > 
    > Thank you already!
    > Julian
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
    For additional commands, e-mail: general-h...@incubator.apache.org
    
    

Reply via email to