Hi Lahiru, I have gone through above patches and play around with it little bit.. CEP873.patch seems fine and i have done some basic testing But i have got some issues when applying CEP877.patch. Even-though i am able to get rid off that after making some changes, I think it is better if you send proper patch for that.
Will commit these improvements once i got the patch for CEP877. Thanks, Mohan On Mon, Jun 30, 2014 at 10:27 AM, Mohanadarshan Vivekanandalingam < [email protected]> wrote: > > > On Sat, Jun 28, 2014 at 2:22 PM, Lahiru Gunathilake <[email protected]> > wrote: > >> Hi Mohan, >> >> I have attached the latest patch for cep-877(I removed the old patch from >> the jira). But the very first patch I have attached in CEP-873 is required >> by this patch. >> >> I made few improvements to both the algorithms where you can give the >> attributes you want to count. Initial version I did was to count distinct >> tuples but practically I think counting distinct attributes is going to be >> useful. If user doesn't give any attribute I simply count the distinct >> tuple with all the attributes. >> >> I have added two test cases but I can see in the build cluster test cases >> are removed, I did test locally only with my test cases and worked fine. >> >> If you have any issues with the two patches please let me know. >> >> > OK Lahiru, I'll go through that Today... > > Thanks, > Mohan > > >> Thanks >> Lahiru >> On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote: >> >> Hi Lahiru, >> >> As Srinath has mentioned as well, frequency counting algorithms are very >> useful in financial market scenarios as well (especially fraud detection >> and surveillance). >> Thanks for doing this and I will take a look too. >> >> seshika >> >> >> On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam < >> [email protected]> wrote: >> >>> >>> >>> >>> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake < >>> [email protected]> wrote: >>> >>>> Hi Mohan, >>>> >>> >>> Hi Lahiru, >>> >>> >>>> >>>> I wrote some samples but I can write more test-cases and provide >>>> another patch. Please feel free to change the naming of the windows as you >>>> like. >>>> >>> >>> Really appreciate your contribution.. Sure, I'll start look into this.. >>> >>> Thanks, >>> Mohan >>> >>> >>>> Regards >>>> Lahiru >>>> >>>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote: >>>> >>>> Hi All, >>>> >>>> Lahiru and myself had a call today morning. >>>> >>>> Plan is to >>>> 1) Lahiru to look at hoeffding tree and other classification algorithms >>>> and select one to implement. He will compare the performance against MOA or >>>> some other implementation. >>>> 2) then we will use it for a Fraud analysis scenario as a proof of its >>>> validity. >>>> >>>> Then we will decide how to continue from that point. >>>> >>>> Mohan could you look at the patch? Lahiru will write test cases that >>>> you can use to verify. >>>> >>>> --Srinath >>>> >>>> >>>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake < >>>> [email protected]> wrote: >>>> >>>>> Hi Srinath, >>>>> >>>>> Thanks for the response. >>>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote: >>>>> >>>>> Hi Lahiru, >>>>> >>>>> Sorry for not responding earlier. I was traveling last week. >>>>> >>>>> I guess you know frequent item set != clustering algorithms. >>>>> >>>>> +1 >>>>> >>>>> >>>>> Can we have a chat sometime to discuss details about the >>>>> implementation. >>>>> >>>>> Yes sure, that would be very useful. >>>>> >>>>> Thanks >>>>> Lahiru >>>>> >>>>> >>>>> --Srinath >>>>> >>>>> >>>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have implemented another frequency counting algorithm[1] which is a >>>>>> classic algorithm for mining frequent items in a data stream. Basically >>>>>> users can specify a minimum average value of frequent items with an error >>>>>> value. >>>>>> >>>>>> This algorithm will accept two user-specified parameters: a support >>>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s >>>>>> and recommended number for e is normally s/10 or s/20. Let N denote >>>>>> the current length >>>>>> of the stream, i.e., the number of tuples seen so far. At any point >>>>>> of time, this algorithm can be asked to produce a list of events >>>>>> along with their estimated frequencies. The answers produced by this >>>>>> algorithm will have the >>>>>> following guarantees: >>>>>> 1. All item(set)s whose true frequency exceeds sN are outputs. >>>>>> There are no false negatives . >>>>>> 2. No events whose true frequency is less than (s - e) N >>>>>> is output. >>>>>> 3. Estimated frequencies are less than the true frequencies >>>>>> by at most eN. >>>>>> . >>>>>> The incoming stream is conceptually divided in to buckets of width w >>>>>> = ceiling(1/e) transactions each. Buckets are labeled with bucket >>>>>> ids , starting from 1.We denote the current bucket id by bcurrent >>>>>> whose value is ceiling(N/w). For an element , we denote its true >>>>>> frequency in the stream seen so far by "fe" . Note that e and w are >>>>>> fixed for a data stream while N, bcurrent and fe are the variables whose >>>>>> value changes when the stream progress. >>>>>> Here the data structure , is a set of entries are tuples with the >>>>>> form of (event, f, delta), where event is the actual event and "f" is >>>>>> the is an integer representing its estimated frequency, and delta >>>>>> is the maximum possible error in "fe". >>>>>> >>>>>> In this algorithm every new event will anyways added in to the >>>>>> data-structure optimistically and there is no initial condition to enter >>>>>> in >>>>>> to the data-structure but iteratively if certain events are not match to >>>>>> the condition provided >>>>>> by the user those events will be removed when N%windowSize = 0, >>>>>> during this I output those events as expired-events in the window. For >>>>>> the >>>>>> incoming events if event is already exists I output those as >>>>>> current-events >>>>>> and if its a new >>>>>> event I just add it to the data-structure and iterate through all the >>>>>> events available and find the matching events based on s and e values and >>>>>> only those events will be output. >>>>>> >>>>>> I am not sure based on the window definition this is the correct >>>>>> approach or may be windows aren't the best way to associate this >>>>>> algorithm >>>>>> in to siddhi. I have attached my patch to jira[2] and if you can look >>>>>> that >>>>>> would be great. >>>>>> >>>>>> Siddhi Query will looks like below, >>>>>> from cseEventStream#window.lossyFrequent(0.1,0.01) " + >>>>>> "select >>>>>> symbol, price " + >>>>>> "insert into >>>>>> StockQuote; >>>>>> >>>>>> >>>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate >>>>>> frequency counts over data streams. In Proceedings of the 28th >>>>>> international conference on Very Large Data Bases (VLDB '02). VLDB >>>>>> Endowment 346-357. >>>>>> >>>>>> Regards >>>>>> Lahiru >>>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote: >>>>>> >>>>>> Hi All, >>>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote: >>>>>> >>>>>> Hi All, >>>>>> >>>>>> I am planning to evaluate different event stream clustering >>>>>> algorithms as part of my studies(I am a graduate student at indiana >>>>>> University). I think Siddhi is a good place to experiment this, As per my >>>>>> understanding based on the docs Siddhi doesn't have a stream clustering >>>>>> interface I can use directly to plug my own algorithm. So I am thinking >>>>>> of >>>>>> first come up an interface for different clustering algorithms and add >>>>>> implementation of algorithms for each event stream by invoking an >>>>>> operation >>>>>> like SiddhiManager.addQuery. Or I can make the algorithm configure as >>>>>> part >>>>>> of query language. If the second option is more consistent with current >>>>>> model I can wrap-up the work in that way but initially focussing on first >>>>>> approach will be easier for me. So each algorithm can be associated to a >>>>>> desired event Stream or can be associated globally. If its associated >>>>>> with >>>>>> each stream algorithm will run local to each stream otherwise it will run >>>>>> in global context. Based on the algorithm I can provide a way to >>>>>> configure >>>>>> it with parameters. >>>>>> >>>>>> I am sure I have confused with above implementation details, after >>>>>> looking in to Siddhi extension points I figured out I just have to >>>>>> implement a new window type. I have implemented one algorithm to keep the >>>>>> most frequent events >>>>>> came in a event stream. So queries can looks like below, >>>>>> >>>>>> from cseEventStream#window.frequent(2) " + >>>>>> "select >>>>>> symbol, price " + >>>>>> "insert into >>>>>> StockQuote; >>>>>> >>>>>> There are multiple algorithms to keep the most frequent events in a >>>>>> given window size for now I just implemented a simple algorithm[1] with >>>>>> the >>>>>> processing complexity of O(1) and space complexity O(n) where n is the >>>>>> limit of the most frequent items. I have created a patch and attached it >>>>>> to >>>>>> jira[2]. >>>>>> >>>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in >>>>>> Science of computer programming 2, no. >>>>>> 2 (1982): 143-152. >>>>>> [2]https://wso2.org/jira/browse/CEP-877 >>>>>> >>>>>> >>>>>> Thanks >>>>>> Lahiru >>>>>> >>>>>> To start this I hope to implement a frequent item set mining >>>>>> algorithm which can be used to find out most frequent items of an event >>>>>> stream. Search engines use these kind of data to find out most frequent >>>>>> searches in a given time window and optimize the search queries. I can >>>>>> start with some algorithms like Misra-Gries algorithm[1] and Manku >>>>>> and Motwani [2] and then move towards more of data clustering >>>>>> algorithms. For the time being I will write the clustering results in to >>>>>> a >>>>>> file and later I think I can use more stable storage (either wso2 >>>>>> registry >>>>>> or other prefered way in wso2 product stack). If Siddhi or WSO2 CEP >>>>>> already >>>>>> have the capability of frequent item mining I will start with a more >>>>>> classification type algorithm. >>>>>> >>>>>> Your feedback will be very useful for my work. If you have >>>>>> requirement for any specific type of algorithms based on the real client >>>>>> interactions you have, I would like to know them and implement them with >>>>>> Siddhi and do the comparison. >>>>>> >>>>>> Thanks >>>>>> Lahiru >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> [email protected] >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> ============================ >>>>> Srinath Perera, Ph.D. >>>>> http://people.apache.org/~hemapani/ >>>>> http://srinathsview.blogspot.com/ >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Srinath Perera, Ph.D. >>>> http://people.apache.org/~hemapani/ >>>> http://srinathsview.blogspot.com/ >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>>> >>> >>> >>> -- >>> *V. Mohanadarshan* >>> *Software Engineer,* >>> *Data Technologies Team,* >>> *WSO2, Inc. http://wso2.com <http://wso2.com/> * >>> *lean.enterprise.middleware.* >>> >>> email: [email protected] >>> phone:(+94) 771117673 >>> >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > *V. Mohanadarshan* > *Software Engineer,* > *Data Technologies Team,* > *WSO2, Inc. http://wso2.com <http://wso2.com> * > *lean.enterprise.middleware.* > > email: [email protected] > phone:(+94) 771117673 > -- *V. Mohanadarshan* *Software Engineer,* *Data Technologies Team,* *WSO2, Inc. http://wso2.com <http://wso2.com> * *lean.enterprise.middleware.* email: [email protected] phone:(+94) 771117673
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
