On Tue, Mar 11, 2014 at 4:10 PM, Leo Romanoff <[email protected]> wrote:
> Hi, > > As you requested, I created the following issues: > > https://wso2.org/jira/browse/CEP-709 - about sharing stream > representations > > https://wso2.org/jira/browse/CEP-710 - about performance problems due to > linear iteration over rules > > https://wso2.org/jira/browse/CEP-711 - provide source jars for Siddhi > > > Siddhi does not support optional fields, we did this for performance > actually. > > I see your point. But is it really true that it improves performance? And > after all, I suggest supporting maps or optional fields only if a user > demands it. I.e. current Object[] based approach is the default and only if > a user explicitly asks for map-based representation or optional fields, > then another representation is used. > > I could even imagine a mixture of both representations: > - Object[] is still used for sending events > - All mandatory fields (e.g. K fields) are the first K elements of this > array. > - All optional fields are put into a map which is passed as the last > element of the array, i.e. it has index K. > - If there are no optional elements allowed, there is no element at index K > > +1 for this approach, we'll add this to the road map Suho Best regards, > -Leo > > Srinath Perera <[email protected]> schrieb am 10:34 Dienstag, 11.März > 2014: > > > First of all, thank you very much for your explanations and > clarifications! It is very interesting and useful! > > Let me ask a few more questions and provide a few comments. > > > Hi All, these questions and answers are very educating. Shall we add > them to our doc FAQs? > > I think it would be a very good idea to add something like this to the > FAQs or to create some sort of an "architecture and implementation > overview" document. > > 1) How many rules/queries can be defined in one engine. How does it affect > performance? > > For example, can I define (tens of) thousands of queries using the same > (or multiple) instance of SiddhiManager? Would it make processing much > slower? Or is the speed not proportional to the number of queries? E.g. > when a new event arrives, does Siddhi test it in a linear fashion against > each query or does Siddhi keep an internal state machine that tries to > match an event against all rules at once? > > > > SiddhiManager can have many queries, and if you chain the queries in a > liner fashion then all those queries will be executed > > one after the other and you might see some performance degradation, but > if you have have then parallel then there wont be > > any issues. > > Well, before I got this answer, I created a few test-cases to check > experimentally how it behaves. I created a single instance of a > SiddhiManager, added 10000 queries that all read from the same input > stream, check if a specific attribute (namely, price) of an event is inside > a given random interval ( [ price >= random_low and price <= random_high] ) > and output into randomly into one of 100 streams. Then I measured the time > required to process 1000000 events using this setup. I also did exactly the > same experiment with Esper. > > My findings were that Siddhi is much slower than Esper in this setup. > After looking into the internal implementations of both, I realized the > reason. Siddhi processes all queries that read from the same input stream > in a linear fashion, sequentially. Even if many of the queries have almost > the same condition, no optimization attempts are done by Siddhi. Esper > detects that many queries have a condition on the same variable and create > some sort of a decision tree. As a result, their running time in log N, > where as Siddhi needs O(n). > > I'm not saying that this test-case if very typical or important, but may > be Siddhi should try to analyze the complete set of queries and try to > apply some optimizations, when it is possible? I.e. it is a bit of a global > optimization applied. It could detect some common sub-expressions or > sub-conditions in the queries and evaluate them only once, instead of doing > it over and over again by evaluating each query separately. > > After getting these first results, I changed the setup, so that each query > uses one of many input streams (e.g. one of 300) instead of using the same > one. This greatly improved the situation, because now the number of queries > per input stream was much smaller and thus processing was way faster. But > even in this setup it is still about 5-6 times slower than Esper in this > situation. > > > Could you share your testcases?, and we can have a look. Yes we have not > much worked with 1000s of queries much, > > > Yes, I could provide my testcases - the source code is actually pretty > small. What is the best way to do it? Should I simply attach a ZIP file > with my project or better create a small github project? > > > Could you report a JIRA here https://wso2.org/jira/browse/CEP and attach > it? > > > > but likely it is something we can fix without much trouble. > > > > Sounds promising. > > > > > 2) Is it possible to easily disable/enable some queries? > > In my use-cases I have a lot of queries. Actually, I have a lot of tenants > and each tenant may have something like 10-100 queries. Rather often (e.g. > few times a day), tenants would like to disable/enable some of their > queries. What is a proper way to do it? Is it a costly operation, i.e. does > Siddhi need to perform a lot of processing to disable or enabled a query? > Is it better to keep a dedicated SiddhiManager instance per tenant or is > it OK to have one SiddhiManager instance which handles all those tenants > with all their queries? > > > The general norm is, you have to use a SiddhiManager per scenario, where > each scenario might contain one or more queries, > > with this modal its easy if any tenant want to add a remove a scenario > and it will not affect other queries and tenants. > > If I have tens of thousands of tenants, then having a dedicated > SiddhiManager per tenant is probably not very practical or even possible, > as it will get pretty heave weight, I guess. > > Therefore, having the ability to enable/disable to query could be very > practical. In fact, it could be probably implemented very easily. Imagine > that each query object has a boolean flag that indicates if it is enabled > or not. If the condition matches and before Siddhi tries to perform the > insert, i.e. the action, it could check if the query is disabled. If it is > disabled, no action (i.e. insert) is performed at all. Of course, there is > still some overhead when matching the query. But may be even this can be > skipped if query is disabled? I.e. conditions are immediately evaluated to > "false" and thus never trigger? > > BTW, Esper has this feature. You can disable/enable any query without > removing and later adding it again. > > My understanding is Siddhi manager is not heavy, but will let Suho > answer. > > > > > When it comes to Siddhi persistent stores, you write: > >It only stores the state information of the processing, E.g the current > running Avg of the average calculation. This will be used >when server > recovers from a failure. > > OK. I understand what it does now. BTW, does it also store any sliding > windows as well so that failover may happen? > > Yes, it store everything so fail over works. > > > > My further question is: How to support more dynamic scenarios, where the > set of queries is not totally static? What if the set of rules changes a > few times per hour/day/etc? May be it would also make sense to persist a > set of queries that were deployed on a given SiddhiManager? This way a user > doesn't need to perform any custom book-keeping for the set of queries. > > Yet another question about Siddhi: > Is it possible to express queries that work with absolute time or timers > without providing a time inside events? E.g. how can one express in the > query something like: "time is between 9:30 AM and 10:00 AM"? It is > possible to work with timers in the query? Basically, I'd like to trigger > certain actions at a specific time or on a regular basis (every N minutes) > and I'm wondering how this can be expressed using Siddhi's query language. > > > One trick I have used is I have created an timer stream, and sent event to > that timer stream periodically and I have written the query using that > timer stream to do what I need. We wanted to add timer as an inbuilt > concept, so you just days from Timer(10s) to receive events every 10 secs , > but not yet added I think. > > > > Ah, so it is a planned feature? Cool! > > > Yes > > > > > And my last question for now: > Is it possible to have nested structures in events, e.g. something like > this: "select field1.field12[3].field1234 from ..."? It means that an > event has a field called field1, which in turn has an array sub-field > called field12, and each element of this array has a field field1234. Is it > possible? Or does Siddhi assume a flat structure of events, i.e. each event > can have only fields of basic types? > > > No we do not do nested structure within Siddhi, it assumes flat events. > e.g. XML we take and match to a flat structure. > > > OK. I understand. I'd say it covers 90% of all use-cases. But having > support for nested structures (a-la Esper) could be interesting. And I > think the implementation would be pretty straight forward. > > > We have input and output adaptor that let us map tree (e.g. like XML) to a > flat structure. Yes still there are some scenarios it does not cover. > > > > BTW, a few questions somewhat related to this question: > - What if I need to handle events which have a few mandatory fields and > all other fields are optional? "define stream" only allows for a fixed > structure, AFAIK. Especially, because it is assumed to be mapped to an > object array. But it could be interesting to allow mapping of events to > key/value maps. With this representation it can be pretty easy to support > events/streams with any number of fields. The mandatory ones can be > described in "define stream" and others are basically accessed at run-time > by means of a key lookup. The syntax could be: > define stream MyType map (fiield1 string, field2 int, field3 float) > > > Siddhi does not support optional fields, we did this for performance > actually. > > - In some of my test-cases, I wanted to avoid using a single stream for > all tenants (because it is very slow - see my previous messages). So, I > created one stream per tenant (e.g. 300000). All such streams are > structurally the same, but have different names. I noticed that it consumes > quite some memory, because stream definitions are not shared, even though > they are immutable as far as I understand. May be it would be a good idea > to share stream definitions if they are the same? I.e. StreamDefinition has > two fields: "String name" and "StreamRepresentation streamRep". The > representation part could be shared by all streams with the same structure. > Even better idea could be to allow custom types. Then you could do > something like: > define type MyType (fiield1 string, field2 int, field3 float) > define stream MyStream1 MyType > define stream MyStream2 MyType > define stream MyStream3 MyType > define stream MyStream4 MyType > ... > > Plus, if custom types could be defined, one could allow using them in > stream/type definitions, e.g.: > define type MySecondType (fiield1 string, field2 int, field3 float, field3 > MyType) > > > Sharing stream representations is a good idea, and I think it is not too > hard to do. Could you open a Jira? > > Now, a different question: As far as I understand, it is currently > possible to join only 2 streams at once. Is it a correct understanding? If > this is the case, I'd like to understand the reasons for this limitation. > Is there a real technical problem that makes joining of >=3 streams > difficult or impossible? Or is it a temporary problem? Some of the rules > used in my use-cases require inputs from 4-6 streams. Modeling it using > multi-level 2-way joins is really annoying. Having support for n-way joins > would make my life much easier ;-) > > > We decided to keep it simple. May be we should do syntax this, that > internally do a multi-level join. This though need some work, and will take > some time before we get to it. > > > And BTW the current syntax for sequences is a bit ... misleading, IMHO > (though it is a minor issue). When someone writes "from Stream1 as s1, > Stream2 as s2, Stream3 as s3 ....", one usually expects that it means a > join from all those streams, because this is how SQL works and some other > CEP engines as well (e.g. Esper). But Siddhi treats it as a sequence of > events, which is a very different thing. Therefore I think that this syntax > is a bit dangerous for newcomers or those familiar with SQL and/or other > CEP engines. > > > When difference is known, then it is pretty intuitive and powerful. We > though it is much easier to think about it that way. But I see what you > mean as well. > > > > One more thing I noticed while experimenting with Siddhi: > - Siddhi JARs are available from Maven central or WSO2 maven repos, which > is very nice. But would it be possible to provide source jars as well (not > only for siddhi, but also for all WSO2 projects)? Right now they are not > available and I had to checkout the whole WSO2 repo to build Siddhi binary > and source JARs. And this repo checkout is > 500 MB big, so that it takes a > while ;-( > > > Thanks, > Leo > > > > > -- > ============================ > Srinath Perera, Ph.D. > Director, Research, WSO2 Inc. > Visiting Faculty, University of Moratuwa > Member, Apache Software Foundation > Research Scientist, Lanka Software Foundation > Blog: http://srinathsview.blogspot.com/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > > > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *S. Suhothayan* Associate Technical Lead, *WSO2 Inc. *http://wso2.com * <http://wso2.com/>* lean . enterprise . middleware *cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
