On Mon, Mar 10, 2014 at 11:19 AM, Leo Romanoff wrote:
>>1) How many rules/queries can be defined in one engine. How does it affect >>performance? >> >> For example, can I define (tens of) thousands of queries using the same >>(or multiple) instance of SiddhiManager? Would it make processing much >>slower? Or is the speed not proportional to the number of queries? E.g. when >>a new event arrives, does Siddhi test it in a linear fashion against each >>query or does Siddhi keep an internal state machine that tries to match an >>event against all rules at once? >> > > >> SiddhiManager can have many queries, and if you chain the queries in a liner >> fashion then all those queries will be executed >> one after the other and you might see some performance degradation, but if >> you have have then parallel then there wont be > >> any issues. > > > >Well, before I got this answer, I created a few test-cases to check >experimentally how it behaves. I created a single instance of a SiddhiManager, >added 10000 queries that all read from the same input stream, check if a >specific attribute (namely, price) of an event is inside a given random >interval ( [ price >= random_low and price <= random_high] ) and output into >randomly into one of 100 streams. Then I measured the time required to process >1000000 events using this setup. I also did exactly the same experiment with >Esper. > > >My findings were that Siddhi is much slower than Esper in this setup. After >looking into the internal implementations of both, I realized the reason. >Siddhi processes all queries that read from the same input stream in a linear >fashion, sequentially. Even if many of the queries have almost the same >condition, no optimization attempts are done by Siddhi. Esper detects that >many queries have a condition on the same variable and create some sort of a >decision tree. As a result, their running time in log N, where as Siddhi needs >O(n). > > >I'm not saying that this test-case if very typical or important, but may be >Siddhi should try to analyze the complete set of queries and try to apply some >optimizations, when it is possible? I.e. it is a bit of a global optimization >applied. It could detect some common sub-expressions or sub-conditions in the >queries and evaluate them only once, instead of doing it over and over again >by evaluating each query separately. > > >After getting these first results, I changed the setup, so that each query >uses one of many input streams (e.g. one of 300) instead of using the same >one. This greatly improved the situation, because now the number of queries >per input stream was much smaller and thus processing was way faster. But even >in this setup it is still about 5-6 times slower than Esper in this situation. > > I'd like to get a bit more specific on this point. For the sake of simplicity, let's say I need to model a lot of sensors (e.g. 100000 or 1000000). All sensors produce the same events, e.g. SensorEvent(id string, value float), where id is the unique id of a sensor. For some/all of the sensors there are a few queries (e.g. 2-10) that analyze events from a single or multiple sensors. Obviously, to be able to refer only to events from specific sensors, each such query uses one or multiple filters like SensorEvent(id=SensorN) to get only the expected events. Now imagine that I have 10000 or even 100000 such queries in total (for all my sensors). The processing using Siddhi gets very slow in this case, because all events are put into the same event stream and this event stream has a huge number of listeners, i.e. queries reading from it. Currently, Siddhi goes over each query in linear fashion and checks it conditions. There are some workarounds, as I described above, e.g. allocating one event stream per sensor and then pre-filtering events received from sensors and putting them into a related event stream. But this quickly gets annoying because the whole idea of CEP is to delegate this kind of optimizations/decisions to the CEP engine and avoid manual event processing. I see different alternatives to solve it in a proper way: - one of the alternatives was described above already. It is pretty generic. Siddhi analyzes all queries and figures out that certain conditions are (almost) the same. Therefore it can evaluate the condition only once (e.g. SensorEvent.id) and then dispatch based on its value. May be some sort of a search tree could be used to figure out a set of queries with a matching filter (Esper seems to do something like this). I have filed an issue for this already. - yet another alternative that I had in mind was to something very similar to "partition by". In principle, "partition by" can already effectively split the input stream into partitions. The only problem is that exactly the _same_ query(s) is applied to each partition, whereas I need a small, partition-specific set of queries to be applied for each partition. It feels like it could be possible to extend/adapt "partition by" to achieve it or implement something along the lines of "partition by", but I don't know Siddhi's implementation to judge if it is feasible at all and how much effort it would need. Questions: - Are there more efficient ways to model the "huge number of sensors" scenario that I described above with existing Siddhi implementation and without doing part of event processing by hand? - What do you think about the "partition by"-like alternative that I presented? Does it make sense? Can it be easily implemented? Thanks, Leo _______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
