On Sat, Jun 28, 2014 at 2:22 PM, Lahiru Gunathilake <[email protected]> wrote:
> Hi Mohan, > > I have attached the latest patch for cep-877(I removed the old patch from > the jira). But the very first patch I have attached in CEP-873 is required > by this patch. > > I made few improvements to both the algorithms where you can give the > attributes you want to count. Initial version I did was to count distinct > tuples but practically I think counting distinct attributes is going to be > useful. If user doesn't give any attribute I simply count the distinct > tuple with all the attributes. > > I have added two test cases but I can see in the build cluster test cases > are removed, I did test locally only with my test cases and worked fine. > > If you have any issues with the two patches please let me know. > > OK Lahiru, I'll go through that Today... Thanks, Mohan > Thanks > Lahiru > On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote: > > Hi Lahiru, > > As Srinath has mentioned as well, frequency counting algorithms are very > useful in financial market scenarios as well (especially fraud detection > and surveillance). > Thanks for doing this and I will take a look too. > > seshika > > > On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam < > [email protected]> wrote: > >> >> >> >> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake < >> [email protected]> wrote: >> >>> Hi Mohan, >>> >> >> Hi Lahiru, >> >> >>> >>> I wrote some samples but I can write more test-cases and provide another >>> patch. Please feel free to change the naming of the windows as you like. >>> >> >> Really appreciate your contribution.. Sure, I'll start look into this.. >> >> Thanks, >> Mohan >> >> >>> Regards >>> Lahiru >>> >>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote: >>> >>> Hi All, >>> >>> Lahiru and myself had a call today morning. >>> >>> Plan is to >>> 1) Lahiru to look at hoeffding tree and other classification algorithms >>> and select one to implement. He will compare the performance against MOA or >>> some other implementation. >>> 2) then we will use it for a Fraud analysis scenario as a proof of its >>> validity. >>> >>> Then we will decide how to continue from that point. >>> >>> Mohan could you look at the patch? Lahiru will write test cases that you >>> can use to verify. >>> >>> --Srinath >>> >>> >>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake < >>> [email protected]> wrote: >>> >>>> Hi Srinath, >>>> >>>> Thanks for the response. >>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote: >>>> >>>> Hi Lahiru, >>>> >>>> Sorry for not responding earlier. I was traveling last week. >>>> >>>> I guess you know frequent item set != clustering algorithms. >>>> >>>> +1 >>>> >>>> >>>> Can we have a chat sometime to discuss details about the >>>> implementation. >>>> >>>> Yes sure, that would be very useful. >>>> >>>> Thanks >>>> Lahiru >>>> >>>> >>>> --Srinath >>>> >>>> >>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake < >>>> [email protected]> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I have implemented another frequency counting algorithm[1] which is a >>>>> classic algorithm for mining frequent items in a data stream. Basically >>>>> users can specify a minimum average value of frequent items with an error >>>>> value. >>>>> >>>>> This algorithm will accept two user-specified parameters: a support >>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and >>>>> recommended number for e is normally s/10 or s/20. Let N denote the >>>>> current length >>>>> of the stream, i.e., the number of tuples seen so far. At any point >>>>> of time, this algorithm can be asked to produce a list of events >>>>> along with their estimated frequencies. The answers produced by this >>>>> algorithm will have the >>>>> following guarantees: >>>>> 1. All item(set)s whose true frequency exceeds sN are outputs. >>>>> There are no false negatives . >>>>> 2. No events whose true frequency is less than (s - e) N >>>>> is output. >>>>> 3. Estimated frequencies are less than the true frequencies >>>>> by at most eN. >>>>> . >>>>> The incoming stream is conceptually divided in to buckets of width w = >>>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , >>>>> starting from 1.We denote the current bucket id by bcurrent whose >>>>> value is ceiling(N/w). For an element , we denote its true frequency >>>>> in the stream seen so far by "fe" . Note that e and w are fixed for >>>>> a data stream while N, bcurrent and fe are the variables whose value >>>>> changes when the stream progress. >>>>> Here the data structure , is a set of entries are tuples with the >>>>> form of (event, f, delta), where event is the actual event and "f" is >>>>> the is an integer representing its estimated frequency, and delta is >>>>> the maximum possible error in "fe". >>>>> >>>>> In this algorithm every new event will anyways added in to the >>>>> data-structure optimistically and there is no initial condition to enter >>>>> in >>>>> to the data-structure but iteratively if certain events are not match to >>>>> the condition provided >>>>> by the user those events will be removed when N%windowSize = 0, during >>>>> this I output those events as expired-events in the window. For the >>>>> incoming events if event is already exists I output those as >>>>> current-events >>>>> and if its a new >>>>> event I just add it to the data-structure and iterate through all the >>>>> events available and find the matching events based on s and e values and >>>>> only those events will be output. >>>>> >>>>> I am not sure based on the window definition this is the correct >>>>> approach or may be windows aren't the best way to associate this algorithm >>>>> in to siddhi. I have attached my patch to jira[2] and if you can look that >>>>> would be great. >>>>> >>>>> Siddhi Query will looks like below, >>>>> from cseEventStream#window.lossyFrequent(0.1,0.01) " + >>>>> "select symbol, >>>>> price " + >>>>> "insert into >>>>> StockQuote; >>>>> >>>>> >>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate >>>>> frequency counts over data streams. In Proceedings of the 28th >>>>> international conference on Very Large Data Bases (VLDB '02). VLDB >>>>> Endowment 346-357. >>>>> >>>>> Regards >>>>> Lahiru >>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote: >>>>> >>>>> Hi All, >>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote: >>>>> >>>>> Hi All, >>>>> >>>>> I am planning to evaluate different event stream clustering algorithms >>>>> as part of my studies(I am a graduate student at indiana University). I >>>>> think Siddhi is a good place to experiment this, As per my understanding >>>>> based on the docs Siddhi doesn't have a stream clustering interface I can >>>>> use directly to plug my own algorithm. So I am thinking of first come up >>>>> an >>>>> interface for different clustering algorithms and add implementation of >>>>> algorithms for each event stream by invoking an operation like >>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of >>>>> query language. If the second option is more consistent with current model >>>>> I can wrap-up the work in that way but initially focussing on first >>>>> approach will be easier for me. So each algorithm can be associated to a >>>>> desired event Stream or can be associated globally. If its associated with >>>>> each stream algorithm will run local to each stream otherwise it will run >>>>> in global context. Based on the algorithm I can provide a way to configure >>>>> it with parameters. >>>>> >>>>> I am sure I have confused with above implementation details, after >>>>> looking in to Siddhi extension points I figured out I just have to >>>>> implement a new window type. I have implemented one algorithm to keep the >>>>> most frequent events >>>>> came in a event stream. So queries can looks like below, >>>>> >>>>> from cseEventStream#window.frequent(2) " + >>>>> "select symbol, >>>>> price " + >>>>> "insert into >>>>> StockQuote; >>>>> >>>>> There are multiple algorithms to keep the most frequent events in a >>>>> given window size for now I just implemented a simple algorithm[1] with >>>>> the >>>>> processing complexity of O(1) and space complexity O(n) where n is the >>>>> limit of the most frequent items. I have created a patch and attached it >>>>> to >>>>> jira[2]. >>>>> >>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in >>>>> Science of computer programming 2, no. >>>>> 2 (1982): 143-152. >>>>> [2]https://wso2.org/jira/browse/CEP-877 >>>>> >>>>> >>>>> Thanks >>>>> Lahiru >>>>> >>>>> To start this I hope to implement a frequent item set mining algorithm >>>>> which can be used to find out most frequent items of an event stream. >>>>> Search engines use these kind of data to find out most frequent searches >>>>> in >>>>> a given time window and optimize the search queries. I can start with some >>>>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] >>>>> and then move towards more of data clustering algorithms. For the time >>>>> being I will write the clustering results in to a file and later I think I >>>>> can use more stable storage (either wso2 registry or other prefered way in >>>>> wso2 product stack). If Siddhi or WSO2 CEP already have the capability of >>>>> frequent item mining I will start with a more classification type >>>>> algorithm. >>>>> >>>>> Your feedback will be very useful for my work. If you have requirement >>>>> for any specific type of algorithms based on the real client interactions >>>>> you have, I would like to know them and implement them with Siddhi and do >>>>> the comparison. >>>>> >>>>> Thanks >>>>> Lahiru >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> [email protected] >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Srinath Perera, Ph.D. >>>> http://people.apache.org/~hemapani/ >>>> http://srinathsview.blogspot.com/ >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> >>> -- >>> ============================ >>> Srinath Perera, Ph.D. >>> http://people.apache.org/~hemapani/ >>> http://srinathsview.blogspot.com/ >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >>> >> >> >> -- >> *V. Mohanadarshan* >> *Software Engineer,* >> *Data Technologies Team,* >> *WSO2, Inc. http://wso2.com <http://wso2.com/> * >> *lean.enterprise.middleware.* >> >> email: [email protected] >> phone:(+94) 771117673 >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *V. Mohanadarshan* *Software Engineer,* *Data Technologies Team,* *WSO2, Inc. http://wso2.com <http://wso2.com> * *lean.enterprise.middleware.* email: [email protected] phone:(+94) 771117673
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
