Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Sriskandarajah Suhothayan Tue, 25 Mar 2014 06:45:52 -0700

On Tue, Mar 11, 2014 at 4:10 PM, Leo Romanoff <[email protected]> wrote:


> Hi,
>
> As you requested, I created the following issues:
>
> https://wso2.org/jira/browse/CEP-709 - about sharing stream
> representations
>
> https://wso2.org/jira/browse/CEP-710 - about performance problems due to
> linear iteration over rules
>
> https://wso2.org/jira/browse/CEP-711 - provide source jars for Siddhi
>
> > Siddhi does not support optional fields, we did this for performance
> actually.
>
> I see your point. But is it really true that it improves performance? And
> after all, I suggest supporting maps or optional fields only if a user
> demands it. I.e. current Object[] based approach is the default and only if
> a user explicitly asks for map-based representation or optional fields,
> then another representation is used.
>
> I could even imagine a mixture of both representations:
> - Object[] is still used for sending events
> - All mandatory fields (e.g. K fields) are the first K elements of this
> array.
> - All optional fields are put into a map which is passed as the last
> element of the array, i.e. it has index K.
> - If there are no optional elements allowed, there is no element at index K
>
> +1 for this approach, we'll add this to the road map

Suho

Best regards,
>   -Leo
>
>   Srinath Perera <[email protected]> schrieb am 10:34 Dienstag, 11.März
> 2014:
>
>
> First of all, thank you very much for your explanations and
> clarifications! It is very interesting and useful!
>
> Let me ask a few more questions and provide a few comments.
>
> > Hi All, these questions and answers are very educating. Shall we add
> them to our doc FAQs?
>
> I think it would be a very good idea to add something like this to the
> FAQs or to create some sort of an "architecture and implementation
> overview" document.
>
> 1) How many rules/queries can be defined in one engine. How does it affect
> performance?
>
>    For example, can I define (tens of) thousands of queries using the same
> (or multiple) instance of SiddhiManager? Would it make processing much
> slower? Or is the speed not proportional to the number of queries? E.g.
> when a new event arrives, does Siddhi test it in a linear fashion against
> each query or does Siddhi keep an internal state machine that tries to
> match an event against all rules at once?
>
>
> > SiddhiManager can have many queries, and if you chain the queries in a
> liner fashion then all those queries will be executed
> > one after the other and you might see some performance degradation, but
> if you have have then parallel then there wont be
> > any issues.
>
> Well, before I got this answer, I created a few test-cases to check
> experimentally how it behaves. I created a single instance of a
> SiddhiManager, added 10000 queries that all read from the same input
> stream, check if a specific attribute (namely, price) of an event is inside
> a given random interval ( [ price >= random_low and price <= random_high] )
> and output into randomly into one of 100 streams. Then I measured the time
> required to process 1000000 events using this setup. I also did exactly the
> same experiment with Esper.
>
> My findings were that Siddhi is much slower than Esper in this setup.
> After looking into the internal implementations of both, I realized the
> reason. Siddhi processes all queries that read from the same input stream
> in a linear fashion, sequentially. Even if many of the queries have almost
> the same condition, no optimization attempts are done by Siddhi. Esper
> detects that many queries have a condition on the same variable and create
> some sort of a decision tree. As a result, their running time in log N,
> where as Siddhi needs O(n).
>
> I'm not saying that this test-case if very typical or important, but may
> be Siddhi should try to analyze the complete set of queries and try to
> apply some optimizations, when it is possible? I.e. it is a bit of a global
> optimization applied. It could detect some common sub-expressions or
> sub-conditions in the queries and evaluate them only once, instead of doing
> it over and over again by evaluating each query separately.
>
> After getting these first results, I changed the setup, so that each query
> uses one of many input streams (e.g. one of 300) instead of using the same
> one. This greatly improved the situation, because now the number of queries
> per input stream was much smaller and thus processing was way faster. But
> even in this setup it is still about 5-6 times slower than Esper in this
> situation.
>
>
>  Could you share your testcases?, and we can have a look. Yes we have not
> much worked with 1000s of queries much,
>
>
> Yes, I could provide my testcases - the source code is actually pretty
> small.  What is the best way to do it? Should I simply attach a ZIP file
> with my project or better create a small github project?
>
>
> Could you report a JIRA here https://wso2.org/jira/browse/CEP and attach
> it?
>
>
>
> but likely it is something we can fix without much trouble.
>
>
>
> Sounds promising.
>
>
>
>
> 2) Is it possible to easily disable/enable some queries?
>
> In my use-cases I have a lot of queries. Actually, I have a lot of tenants
> and each tenant may have something like 10-100 queries. Rather often (e.g.
> few times a day), tenants would like to disable/enable some of their
> queries. What is a proper way to do it? Is it a costly operation, i.e. does
> Siddhi need to perform a lot of processing to disable or enabled a query?
> Is it better to keep a dedicated SiddhiManager instance per tenant or is
> it OK to have one SiddhiManager instance which handles all those tenants
> with all their queries?
>
> > The general norm is, you have to use a SiddhiManager per scenario, where
> each scenario might contain one or more queries,
> > with this modal its easy if any tenant want to add a remove a scenario
> and it will not affect other queries and tenants.
>
> If I have tens of thousands of tenants, then having a dedicated
> SiddhiManager per tenant is probably not very practical or even possible,
> as it will get pretty heave weight, I guess.
>
> Therefore, having the ability to enable/disable to query could be very
> practical. In fact, it could be probably implemented very easily. Imagine
> that each query object has a boolean flag that indicates if it is enabled
> or not. If the condition matches and before Siddhi tries to perform the
> insert, i.e. the action, it could check if the query is disabled. If it is
> disabled, no action (i.e. insert) is performed at all. Of course, there is
> still some overhead when matching the query. But may be even this can be
> skipped if query is disabled? I.e. conditions are immediately evaluated to
> "false" and thus never trigger?
>
> BTW, Esper has this feature. You can disable/enable any query without
> removing  and later adding it again.
>
> My understanding is Siddhi manager is not  heavy, but will let Suho
> answer.
>
>
>
>
> When it comes to Siddhi persistent stores, you write:
> >It only stores the state information of the processing, E.g the current
> running Avg of the average calculation. This will be used >when server
> recovers from a failure.
>
> OK. I understand what it does now. BTW, does it also store any sliding
> windows as well so that failover may happen?
>
> Yes, it store everything so fail over works.
>
>
>
> My further question is: How to support more dynamic scenarios, where the
> set of queries is not totally static? What if the set of rules changes a
> few times per hour/day/etc? May be it would also make sense to persist a
> set of queries that were deployed on a given SiddhiManager? This way a user
> doesn't need to perform any custom book-keeping for the set of queries.
>
> Yet another question about Siddhi:
> Is it possible to express queries that work with absolute time or timers
> without providing a time inside events?  E.g. how can one express in the
> query something like: "time is between 9:30 AM and 10:00 AM"? It is
> possible to work with timers in the query? Basically, I'd like to trigger
> certain actions at a specific time or on a regular basis (every N minutes)
> and I'm wondering how this can be expressed using Siddhi's query language.
>
>
> One trick I have used is I have created an timer stream, and sent event to
> that timer stream periodically and I have written the query using that
> timer stream to do what I need. We wanted to add timer as an inbuilt
> concept, so you just days from Timer(10s) to receive events every 10 secs ,
> but not yet added I think.
>
>
>
> Ah, so it is a planned feature? Cool!
>
>
> Yes
>
>
>
>
> And my last question for now:
> Is it possible to have nested structures in events, e.g. something like
> this: "select field1.field12[3].field1234 from ..."? It means that an
> event has a field called field1, which in turn has an array sub-field
> called field12, and each element of this array has a field field1234. Is it
> possible? Or does Siddhi assume a flat structure of events, i.e. each event
> can have only fields of basic types?
>
>
> No we do not do nested structure within Siddhi, it assumes flat events.
> e.g. XML we take and match to a flat structure.
>
>
> OK. I understand. I'd say it covers 90% of all use-cases. But having
> support for nested structures (a-la Esper) could be interesting. And I
> think the implementation would be pretty straight forward.
>
>
> We have input and output adaptor that let us map tree (e.g. like XML) to a
> flat structure.  Yes still there are some scenarios it does not cover.
>
>
>
> BTW, a few questions somewhat related to this question:
> - What if I need to handle events which have a few mandatory fields and
> all other fields are optional? "define stream" only allows for a fixed
> structure, AFAIK. Especially, because it is assumed to be mapped to an
> object array. But it could be interesting to allow mapping of events to
> key/value maps. With this representation it can be pretty easy to support
> events/streams with any number of fields. The mandatory ones can be
> described in "define stream" and others are basically accessed at run-time
> by means of a key lookup. The syntax could be:
>   define stream MyType map (fiield1 string, field2 int, field3 float)
>
>
> Siddhi does not support optional fields, we did this for performance
> actually.
>
> - In some of my test-cases, I wanted to avoid using a single stream for
> all tenants (because it is very slow - see my previous messages). So, I
> created one stream per tenant (e.g. 300000). All such streams are
> structurally the same, but have different names. I noticed that it consumes
> quite some memory, because stream definitions are not shared, even though
> they are immutable as far as I understand. May be it would be a good idea
> to share stream definitions if they are the same? I.e. StreamDefinition has
> two fields: "String name" and "StreamRepresentation streamRep". The
> representation part could be shared by all streams with the same structure.
>  Even better idea could be to allow custom types. Then you could do
> something like:
>   define type MyType (fiield1 string, field2 int, field3 float)
>   define stream MyStream1 MyType
>   define stream MyStream2 MyType
>   define stream MyStream3 MyType
>   define stream MyStream4 MyType
>   ...
>
> Plus, if custom types could be defined, one could allow using them in
> stream/type definitions, e.g.:
> define type MySecondType (fiield1 string, field2 int, field3 float, field3
> MyType)
>
>
> Sharing stream representations is a good idea, and I think it is not too
> hard to do. Could you open a Jira?
>
> Now, a different question: As far as I understand, it is currently
> possible to join only 2 streams at once. Is it a correct understanding? If
> this is the case, I'd like to understand the reasons for this limitation.
> Is there a real technical problem that makes joining of >=3 streams
> difficult or impossible? Or is it a temporary problem? Some of the rules
> used in my use-cases require inputs from 4-6 streams. Modeling it using
> multi-level 2-way joins is really annoying. Having support for n-way joins
> would make my life much easier ;-)
>
>
>  We decided to keep it simple. May be we should do syntax this, that
> internally do a multi-level join. This though need some work, and will take
> some time before we get to it.
>
>
> And BTW the current syntax for sequences  is a bit ... misleading, IMHO
> (though it is a minor issue). When someone writes "from Stream1 as s1,
> Stream2 as s2, Stream3 as s3 ....", one usually expects that it means a
> join from all those streams, because this is how SQL works and some other
> CEP engines as well (e.g. Esper). But Siddhi treats it as a sequence of
> events, which is a very different thing. Therefore I think that this syntax
> is a bit dangerous for newcomers or those familiar with SQL and/or other
> CEP engines.
>
>
> When difference is known, then it is pretty intuitive and powerful. We
> though it is much easier to think about it that way. But I see what you
> mean as well.
>
>
>
> One more thing I noticed while experimenting with Siddhi:
> - Siddhi JARs are available from Maven central or WSO2 maven repos, which
> is very nice. But would it be possible to provide source jars as well (not
> only for siddhi, but also for all WSO2 projects)? Right now they are not
> available and I had to checkout the whole WSO2 repo to build Siddhi binary
> and source JARs. And this repo checkout is > 500 MB big, so that it takes a
> while ;-(
>
>
> Thanks,
>    Leo
>
>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>   Director, Research, WSO2 Inc.
>   Visiting Faculty, University of Moratuwa
>   Member, Apache Software Foundation
>   Research Scientist, Lanka Software Foundation
>   Blog: http://srinathsview.blogspot.com/
>   Photos: http://www.flickr.com/photos/hemapani/
>    Phone: 0772360902
>
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 

*S. Suhothayan*
Associate Technical Lead,
 *WSO2 Inc. *http://wso2.com
* <http://wso2.com/>*
lean . enterprise . middleware


*cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/
<http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan
<http://twitter.com/suhothayan> | linked-in:
http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] A few questions about WSO2 CEP/Siddhi

Reply via email to