Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Leo Romanoff Tue, 11 Mar 2014 03:47:20 -0700

Hi,

As you requested, I created the following issues:


https://wso2.org/jira/browse/CEP-709 - about sharing stream representations


https://wso2.org/jira/browse/CEP-710 - about performance problems due to linear 
iteration over rules


https://wso2.org/jira/browse/CEP-711 - provide source jars for Siddhi


> Siddhi does not support optional fields, we did this for performance 
> actually. 

I see your point. But is it really true that it improves performance? And after 
all, I suggest supporting maps or optional fields only if a user demands it. 
I.e. current Object[] based approach is the default and only if a user 
explicitly asks for map-based representation or optional fields, then another 
representation is used.

I could even imagine a mixture of both representations: 
- Object[] is still used for sending events
- All mandatory fields (e.g. K fields) are the first K elements of this array. 
- All optional fields are put into a map which is passed as the last element of 
the array, i.e. it has index K.
- If there are no optional elements allowed, there is no element at index K

Best regards,
  -Leo


Srinath Perera <[email protected]> schrieb am 10:34 Dienstag, 11.März 2014:

>>>>
>>>>First of all, thank you very much for your explanations and clarifications! 
>>>>It is very interesting and useful!
>>>>
>>>>
>>>>Let me ask a few more questions and provide a few comments.
>>>>
>>>>
>>>>> Hi All, these questions and answers are very educating. Shall we add them 
>>>>> to our doc FAQs? 
>>>>
>>>>
>>>>I think it would be a very good idea to add something like this to the FAQs 
>>>>or to create some sort of an "architecture and implementation overview" 
>>>>document.
>>>>
>>>>
>>>>1) How many rules/queries can be defined in one engine. How does it affect 
>>>>performance?
>>>>>
>>>>>   For example, can I define (tens of) thousands of queries using the same 
>>>>>(or multiple) instance of SiddhiManager? Would it make processing much 
>>>>>slower? Or is the speed not proportional to the number of queries? E.g. 
>>>>>when a new event arrives, does Siddhi test it in a linear fashion against 
>>>>>each query or does Siddhi keep an internal state machine that tries to 
>>>>>match an event against all rules at once?
>>>>>
>>>>
>>>>
>>>>> SiddhiManager can have many queries, and if you chain the queries in a 
>>>>> liner fashion then all those queries will be executed 
>>>>> one after the other and you might see some performance degradation, but 
>>>>> if you have have then parallel then there wont be 
>>>>
>>>>> any issues.   
>>>>
>>>>
>>>>
>>>>Well, before I got this answer, I created a few test-cases to check 
>>>>experimentally how it behaves. I created a single instance of a 
>>>>SiddhiManager, added 10000 queries that all read from the same input 
>>>>stream, check if a specific attribute (namely, price) of an event is inside 
>>>>a given random interval ( [ price >= random_low and price <= random_high] ) 
>>>>and output into randomly into one of 100 streams. Then I measured the time 
>>>>required to process 1000000 events using this setup. I also did exactly the 
>>>>same experiment with Esper.
>>>>
>>>>
>>>>My findings were that Siddhi is much slower than Esper in this setup. After 
>>>>looking into the internal implementations of both, I realized the reason. 
>>>>Siddhi processes all queries that read from the same input stream in a 
>>>>linear fashion, sequentially. Even if many of the queries have almost the 
>>>>same condition, no optimization attempts are done by Siddhi. Esper detects 
>>>>that many queries have a condition on the same variable and create some 
>>>>sort of a decision tree. As a result, their running time in log N, where as 
>>>>Siddhi needs O(n). 
>>>>
>>>>
>>>>I'm not saying that this test-case if very typical or important, but may be 
>>>>Siddhi should try to analyze the complete set of queries and try to apply 
>>>>some optimizations, when it is possible? I.e. it is a bit of a global 
>>>>optimization applied. It could detect some common sub-expressions or 
>>>>sub-conditions in the queries and evaluate them only once, instead of doing 
>>>>it over and over again by evaluating each query separately.
>>>>
>>>>
>>>>After getting these first results, I changed the setup, so that each query 
>>>>uses one of many input streams (e.g. one of 300) instead of using the same 
>>>>one. This greatly improved the situation, because now the number of queries 
>>>>per input stream was much smaller and thus processing was way faster. But 
>>>>even in this setup it is still about 5-6 times slower than Esper in this 
>>>>situation.
>>>
>>>
>>> Could you share your testcases?, and we can have a look. Yes we have not 
>>>much worked with 1000s of queries much, 
>>
>>
>>Yes, I could provide my testcases - the source code is actually pretty small. 
>> What is the best way to do it? Should I simply attach a ZIP file with my 
>>project or better create a small github project?
>
>
>Could you report a JIRA here https://wso2.org/jira/browse/CEP and attach it?
> 
>
>>
>>but likely it is something we can fix without much trouble. 
>>
>>
>>
>>
>>Sounds promising. 
>> 
>>
>>>>
>>>>
>>>>>2) Is it possible to easily disable/enable some queries?
>>>>>
>>>>>In my use-cases I have a lot of queries. Actually, I have a lot of tenants 
>>>>>and each tenant may have something like 10-100 queries. Rather often (e.g. 
>>>>>few times a day), tenants would like to disable/enable some of their 
>>>>>queries. What is a proper way to do it? Is it a costly operation, i.e. 
>>>>>does Siddhi need to perform a lot of processing to disable or enabled a 
>>>>>query?
>>>>>Is it better to keep a dedicated SiddhiManager instance per tenant or is 
>>>>>it OK to have one SiddhiManager instance which handles all those tenants 
>>>>>with all their queries?
>>>>>
>>>>>
>>>>> The general norm is, you have to use a SiddhiManager per scenario, where 
>>>>> each scenario might contain one or more queries, 
>>>>> with this modal its easy if any tenant want to add a remove a scenario 
>>>>> and it will not affect other queries and tenants.
>>>>
>>>>
>>>>If I have tens of thousands of tenants, then having a dedicated 
>>>>SiddhiManager per tenant is probably not very practical or even possible, 
>>>>as it will get pretty heave weight, I guess.  
>>>>
>>>>
>>>>Therefore, having the ability to enable/disable to query could be very 
>>>>practical. In fact, it could be probably implemented very easily. Imagine 
>>>>that each query object has a boolean flag that indicates if it is enabled 
>>>>or not. If the condition matches and before Siddhi tries to perform the 
>>>>insert, i.e. the action, it could check if the query is disabled. If it is 
>>>>disabled, no action (i.e. insert) is performed at all. Of course, there is 
>>>>still some overhead when matching the query. But may be even this can be 
>>>>skipped if query is disabled? I.e. conditions are immediately evaluated to 
>>>>"false" and thus never trigger?
>>>>
>>>>
>>>>BTW, Esper has this feature. You can disable/enable any query without 
>>>>removing  and later adding it again.
>>>My understanding is Siddhi manager is not  heavy, but will let Suho answer. 
>>>
>>>
>>> 
>>>
>>>>
>>>>When it comes to Siddhi persistent stores, you write:
>>>>>It only stores the state information of the processing, E.g the current 
>>>>>running Avg of the average calculation. This will be used >when server 
>>>>>recovers from a failure. 
>>>>
>>>>
>>>>
>>>>OK. I understand what it does now. BTW, does it also store any sliding 
>>>>windows as well so that failover may happen?
>>>Yes, it store everything so fail over works. 
>>> 
>>>
>>>>
>>>>My further question is: How to support more dynamic scenarios, where the 
>>>>set of queries is not totally static? What if the set of rules changes a 
>>>>few times per hour/day/etc? May be it would also make sense to persist a 
>>>>set of queries that were deployed on a given SiddhiManager? This way a user 
>>>>doesn't need to perform any custom book-keeping for the set of queries. 
>>>>
>>>>
>>>>Yet another question about Siddhi:
>>>>Is it possible to express queries that work with absolute time or timers 
>>>>without providing a time inside events?  E.g. how can one express in the 
>>>>query something like: "time is between 9:30 AM and 10:00 AM"? It is 
>>>>possible to work with timers in the query? Basically, I'd like to trigger 
>>>>certain actions at a specific time or on a regular basis (every N minutes) 
>>>>and I'm wondering how this can be expressed using Siddhi's query language.
>>>
>>>
>>>One trick I have used is I have created an timer stream, and sent event to 
>>>that timer stream periodically and I have written the query using that timer 
>>>stream to do what I need. We wanted to add timer as an inbuilt concept, so 
>>>you just days from Timer(10s) to receive events every 10 secs , but not yet 
>>>added I think. 
>>> 
>>
>>
>>Ah, so it is a planned feature? Cool! 
>
>
>Yes 
> 
>>
>>>>
>>>>And my last question for now:
>>>>Is it possible to have nested structures in events, e.g. something like 
>>>>this: "select field1.field12[3].field1234 from ..."? It means that an event 
>>>>has a field called field1, which in turn has an array sub-field called 
>>>>field12, and each element of this array has a field field1234. Is it 
>>>>possible? Or does Siddhi assume a flat structure of events, i.e. each event 
>>>>can have only fields of basic types?
>>>
>>>
>>>No we do not do nested structure within Siddhi, it assumes flat events. e.g. 
>>>XML we take and match to a flat structure. 
>>
>>
>>OK. I understand. I'd say it covers 90% of all use-cases. But having support 
>>for nested structures (a-la Esper) could be interesting. And I think the 
>>implementation would be pretty straight forward. 
>
>
>We have input and output adaptor that let us map tree (e.g. like XML) to a 
>flat structure.  Yes still there are some scenarios it does not cover. 
> 
>
>>
>>BTW, a few questions somewhat related to this question:
>>- What if I need to handle events which have a few mandatory fields and all 
>>other fields are optional? "define stream" only allows for a fixed structure, 
>>AFAIK. Especially, because it is assumed to be mapped to an object array. But 
>>it could be interesting to allow mapping of events to key/value maps. With 
>>this representation it can be pretty easy to support events/streams with any 
>>number of fields. The mandatory ones can be described in "define stream" and 
>>others are basically accessed at run-time by means of a key lookup. The 
>>syntax could be:
>>  define stream MyType map (fiield1 string, field2 int, field3 float)
>>
>
>
>Siddhi does not support optional fields, we did this for performance actually. 
>
>
>- In some of my test-cases, I wanted to avoid using a single stream for all 
>tenants (because it is very slow - see my previous messages). So, I created 
>one stream per tenant (e.g. 300000). All such streams are structurally the 
>same, but have different names. I noticed that it consumes quite some memory, 
>because stream definitions are not shared, even though they are immutable as 
>far as I understand. May be it would be a good idea to share stream 
>definitions if they are the same? I.e. StreamDefinition has two fields: 
>"String name" and "StreamRepresentation streamRep". The representation part 
>could be shared by all streams with the same structure.  Even better idea 
>could be to allow custom types. Then you could do something like:
>>  define type MyType (fiield1 string, field2 int, field3 float)
>>
>>  define stream MyStream1 MyType
>>  define stream MyStream2 MyType
>>  define stream MyStream3 MyType
>>  define stream MyStream4 MyType
>>  ...
>>
>>
>>Plus, if custom types could be defined, one could allow using them in 
>>stream/type definitions, e.g.:
>>define type MySecondType (fiield1 string, field2 int, field3 float, field3 
>>MyType)
>>
>
>
>Sharing stream representations is a good idea, and I think it is not too hard 
>to do. Could you open a Jira?
>
>
>Now, a different question: As far as I understand, it is currently possible to 
>join only 2 streams at once. Is it a correct understanding? If this is the 
>case, I'd like to understand the reasons for this limitation. Is there a real 
>technical problem that makes joining of >=3 streams difficult or impossible? 
>Or is it a temporary problem? Some of the rules used in my use-cases require 
>inputs from 4-6 streams. Modeling it using multi-level 2-way joins is really 
>annoying. Having support for n-way joins would make my life much easier ;-) 
>
>
> We decided to keep it simple. May be we should do syntax this, that 
>internally do a multi-level join. This though need some work, and will take 
>some time before we get to it. 
> 
>And BTW the current syntax for sequences  is a bit ... misleading, IMHO 
>(though it is a minor issue). When someone writes "from Stream1 as s1, Stream2 
>as s2, Stream3 as s3 ....", one usually expects that it means a join from all 
>those streams, because this is how SQL works and some other CEP engines as 
>well (e.g. Esper). But Siddhi treats it as a sequence of events, which is a 
>very different thing. Therefore I think that this syntax is a bit dangerous 
>for newcomers or those familiar with SQL and/or other CEP engines.
>
>
>When difference is known, then it is pretty intuitive and powerful. We though 
>it is much easier to think about it that way. But I see what you mean as well. 
> 
>
>>
>>One more thing I noticed while experimenting with Siddhi:
>>- Siddhi JARs are available from Maven central or WSO2 maven repos, which is 
>>very nice. But would it be possible to provide source jars as well (not only 
>>for siddhi, but also for all WSO2 projects)? Right now they are not available 
>>and I had to checkout the whole WSO2 repo to build Siddhi binary and source 
>>JARs. And this repo checkout is > 500 MB big, so that it takes a while ;-(
>>
>>
>>
>>
>>Thanks,
>>   Leo
>
>
>
>-- 
>
>============================
>Srinath Perera, Ph.D.
>  Director, Research, WSO2 Inc.
>  Visiting Faculty, University of Moratuwa
>  Member, Apache Software Foundation
>  Research Scientist, Lanka Software Foundation
>  Blog: http://srinathsview.blogspot.com/
>  Photos: http://www.flickr.com/photos/hemapani/
>   Phone: 0772360902
>
>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Reply via email to