Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Srinath Perera Tue, 11 Mar 2014 02:43:33 -0700

>
>
>>
>>> First of all, thank you very much for your explanations and
>>> clarifications! It is very interesting and useful!
>>>
>>> Let me ask a few more questions and provide a few comments.
>>>
>>> > Hi All, these questions and answers are very educating. Shall we add
>>> them to our doc FAQs?
>>>
>>> I think it would be a very good idea to add something like this to the
>>> FAQs or to create some sort of an "architecture and implementation
>>> overview" document.
>>>
>>> 1) How many rules/queries can be defined in one engine. How does it
>>> affect performance?
>>>
>>>    For example, can I define (tens of) thousands of queries using the
>>> same (or multiple) instance of SiddhiManager? Would it make processing much
>>> slower? Or is the speed not proportional to the number of queries? E.g.
>>> when a new event arrives, does Siddhi test it in a linear fashion against
>>> each query or does Siddhi keep an internal state machine that tries to
>>> match an event against all rules at once?
>>>
>>>
>>> > SiddhiManager can have many queries, and if you chain the queries in a
>>> liner fashion then all those queries will be executed
>>> > one after the other and you might see some performance degradation,
>>> but if you have have then parallel then there wont be
>>> > any issues.
>>>
>>> Well, before I got this answer, I created a few test-cases to check
>>> experimentally how it behaves. I created a single instance of a
>>> SiddhiManager, added 10000 queries that all read from the same input
>>> stream, check if a specific attribute (namely, price) of an event is inside
>>> a given random interval ( [ price >= random_low and price <= random_high] )
>>> and output into randomly into one of 100 streams. Then I measured the time
>>> required to process 1000000 events using this setup. I also did exactly the
>>> same experiment with Esper.
>>>
>>> My findings were that Siddhi is much slower than Esper in this setup.
>>> After looking into the internal implementations of both, I realized the
>>> reason. Siddhi processes all queries that read from the same input stream
>>> in a linear fashion, sequentially. Even if many of the queries have almost
>>> the same condition, no optimization attempts are done by Siddhi. Esper
>>> detects that many queries have a condition on the same variable and create
>>> some sort of a decision tree. As a result, their running time in log N,
>>> where as Siddhi needs O(n).
>>>
>>> I'm not saying that this test-case if very typical or important, but may
>>> be Siddhi should try to analyze the complete set of queries and try to
>>> apply some optimizations, when it is possible? I.e. it is a bit of a global
>>> optimization applied. It could detect some common sub-expressions or
>>> sub-conditions in the queries and evaluate them only once, instead of doing
>>> it over and over again by evaluating each query separately.
>>>
>>> After getting these first results, I changed the setup, so that each
>>> query uses one of many input streams (e.g. one of 300) instead of using the
>>> same one. This greatly improved the situation, because now the number of
>>> queries per input stream was much smaller and thus processing was way
>>> faster. But even in this setup it is still about 5-6 times slower than
>>> Esper in this situation.
>>>
>>
>>  Could you share your testcases?, and we can have a look. Yes we have not
>> much worked with 1000s of queries much,
>>
>
> Yes, I could provide my testcases - the source code is actually pretty
> small.  What is the best way to do it? Should I simply attach a ZIP file
> with my project or better create a small github project?
>


Could you report a JIRA here https://wso2.org/jira/browse/CEP and attach it?


>
> but likely it is something we can fix without much trouble.
>>
>
>
> Sounds promising.
>
>
>>
>>>
>>> 2) Is it possible to easily disable/enable some queries?
>>>
>>> In my use-cases I have a lot of queries. Actually, I have a lot of
>>> tenants and each tenant may have something like 10-100 queries. Rather
>>> often (e.g. few times a day), tenants would like to disable/enable some of
>>> their queries. What is a proper way to do it? Is it a costly operation,
>>> i.e. does Siddhi need to perform a lot of processing to disable or enabled
>>> a query?
>>> Is it better to keep a dedicated SiddhiManager instance per tenant or is
>>> it OK to have one SiddhiManager instance which handles all those tenants
>>> with all their queries?
>>>
>>> > The general norm is, you have to use a SiddhiManager per scenario,
>>> where each scenario might contain one or more queries,
>>> > with this modal its easy if any tenant want to add a remove a scenario
>>> and it will not affect other queries and tenants.
>>>
>>> If I have tens of thousands of tenants, then having a dedicated
>>> SiddhiManager per tenant is probably not very practical or even possible,
>>> as it will get pretty heave weight, I guess.
>>>
>>> Therefore, having the ability to enable/disable to query could be very
>>> practical. In fact, it could be probably implemented very easily.
>>> Imagine that each query object has a boolean flag that indicates if it is
>>> enabled or not. If the condition matches and before Siddhi tries to perform
>>> the insert, i.e. the action, it could check if the query is disabled. If it
>>> is disabled, no action (i.e. insert) is performed at all. Of course, there
>>> is still some overhead when matching the query. But may be even this can be
>>> skipped if query is disabled? I.e. conditions are immediately evaluated to
>>> "false" and thus never trigger?
>>>
>>> BTW, Esper has this feature. You can disable/enable any query without
>>> removing  and later adding it again.
>>>
>> My understanding is Siddhi manager is not  heavy, but will let Suho
>> answer.
>>
>>
>>
>>>
>>> When it comes to Siddhi persistent stores, you write:
>>> >It only stores the state information of the processing, E.g the current
>>> running Avg of the average calculation. This will be used >when server
>>> recovers from a failure.
>>>
>>> OK. I understand what it does now. BTW, does it also store any sliding
>>> windows as well so that failover may happen?
>>>
>> Yes, it store everything so fail over works.
>>
>>
>>>
>>> My further question is: How to support more dynamic scenarios, where the
>>> set of queries is not totally static? What if the set of rules changes a
>>> few times per hour/day/etc? May be it would also make sense to persist a
>>> set of queries that were deployed on a given SiddhiManager? This way a user
>>> doesn't need to perform any custom book-keeping for the set of queries.
>>>
>>> Yet another question about Siddhi:
>>> Is it possible to express queries that work with absolute time or timers
>>> without providing a time inside events?  E.g. how can one express in
>>> the query something like: "time is between 9:30 AM and 10:00 AM"? It is
>>> possible to work with timers in the query? Basically, I'd like to trigger
>>> certain actions at a specific time or on a regular basis (every N minutes)
>>> and I'm wondering how this can be expressed using Siddhi's query language.
>>>
>>
>> One trick I have used is I have created an timer stream, and sent event
>> to that timer stream periodically and I have written the query using that
>> timer stream to do what I need. We wanted to add timer as an inbuilt
>> concept, so you just days from Timer(10s) to receive events every 10 secs ,
>> but not yet added I think.
>>
>>
>
> Ah, so it is a planned feature? Cool!
>

Yes

>
>
>>
>>> And my last question for now:
>>> Is it possible to have nested structures in events, e.g. something like
>>> this: "select field1.field12[3].field1234 from ..."? It means that an
>>> event has a field called field1, which in turn has an array sub-field
>>> called field12, and each element of this array has a field field1234. Is it
>>> possible? Or does Siddhi assume a flat structure of events, i.e. each event
>>> can have only fields of basic types?
>>>
>>
>> No we do not do nested structure within Siddhi, it assumes flat events.
>> e.g. XML we take and match to a flat structure.
>>
>
> OK. I understand. I'd say it covers 90% of all use-cases. But having
> support for nested structures (a-la Esper) could be interesting. And I
> think the implementation would be pretty straight forward.
>

We have input and output adaptor that let us map tree (e.g. like XML) to a
flat structure.  Yes still there are some scenarios it does not cover.


>
> BTW, a few questions somewhat related to this question:
> - What if I need to handle events which have a few mandatory fields and
> all other fields are optional? "define stream" only allows for a fixed
> structure, AFAIK. Especially, because it is assumed to be mapped to an
> object array. But it could be interesting to allow mapping of events to
> key/value maps. With this representation it can be pretty easy to support
> events/streams with any number of fields. The mandatory ones can be
> described in "define stream" and others are basically accessed at run-time
> by means of a key lookup. The syntax could be:
>   define stream MyType map (fiield1 string, field2 int, field3 float)
>

Siddhi does not support optional fields, we did this for performance
actually.

- In some of my test-cases, I wanted to avoid using a single stream for all
> tenants (because it is very slow - see my previous messages). So, I created
> one stream per tenant (e.g. 300000). All such streams are structurally the
> same, but have different names. I noticed that it consumes quite some
> memory, because stream definitions are not shared, even though they are
> immutable as far as I understand. May be it would be a good idea to share
> stream definitions if they are the same? I.e. StreamDefinition has two
> fields: "String name" and "StreamRepresentation streamRep". The
> representation part could be shared by all streams with the same structure.
>  Even better idea could be to allow custom types. Then you could do
> something like:
>   define type MyType (fiield1 string, field2 int, field3 float)
>   define stream MyStream1 MyType
>   define stream MyStream2 MyType
>   define stream MyStream3 MyType
>   define stream MyStream4 MyType
>   ...
>
> Plus, if custom types could be defined, one could allow using them in
> stream/type definitions, e.g.:
> define type MySecondType (fiield1 string, field2 int, field3 float, field3
> MyType)
>

Sharing stream representations is a good idea, and I think it is not too
hard to do. Could you open a Jira?

Now, a different question: As far as I understand, it is currently possible
> to join only 2 streams at once. Is it a correct understanding? If this is
> the case, I'd like to understand the reasons for this limitation. Is there
> a real technical problem that makes joining of >=3 streams difficult or
> impossible? Or is it a temporary problem? Some of the rules used in my
> use-cases require inputs from 4-6 streams. Modeling it using multi-level
> 2-way joins is really annoying. Having support for n-way joins would make
> my life much easier ;-)
>

 We decided to keep it simple. May be we should do syntax this, that
internally do a multi-level join. This though need some work, and will take
some time before we get to it.


> And BTW the current syntax for sequences  is a bit ... misleading, IMHO
> (though it is a minor issue). When someone writes "from Stream1 as s1,
> Stream2 as s2, Stream3 as s3 ....", one usually expects that it means a
> join from all those streams, because this is how SQL works and some other
> CEP engines as well (e.g. Esper). But Siddhi treats it as a sequence of
> events, which is a very different thing. Therefore I think that this syntax
> is a bit dangerous for newcomers or those familiar with SQL and/or other
> CEP engines.
>

When difference is known, then it is pretty intuitive and powerful. We
though it is much easier to think about it that way. But I see what you
mean as well.


>
> One more thing I noticed while experimenting with Siddhi:
> - Siddhi JARs are available from Maven central or WSO2 maven repos, which
> is very nice. But would it be possible to provide source jars as well (not
> only for siddhi, but also for all WSO2 projects)? Right now they are not
> available and I had to checkout the whole WSO2 repo to build Siddhi binary
> and source JARs. And this repo checkout is > 500 MB big, so that it takes a
> while ;-(
>
>
> Thanks,
>    Leo
>



-- 
============================
Srinath Perera, Ph.D.
  Director, Research, WSO2 Inc.
  Visiting Faculty, University of Moratuwa
  Member, Apache Software Foundation
  Research Scientist, Lanka Software Foundation
  Blog: http://srinathsview.blogspot.com/
  Photos: http://www.flickr.com/photos/hemapani/
   Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Reply via email to