Hi Mohan, I wrote some samples but I can write more test-cases and provide another patch. Please feel free to change the naming of the windows as you like.
Regards Lahiru On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote: > Hi All, > > Lahiru and myself had a call today morning. > > Plan is to > 1) Lahiru to look at hoeffding tree and other classification algorithms and > select one to implement. He will compare the performance against MOA or some > other implementation. > 2) then we will use it for a Fraud analysis scenario as a proof of its > validity. > > Then we will decide how to continue from that point. > > Mohan could you look at the patch? Lahiru will write test cases that you can > use to verify. > > --Srinath > > > On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]> > wrote: > Hi Srinath, > > Thanks for the response. > On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote: > >> Hi Lahiru, >> >> Sorry for not responding earlier. I was traveling last week. >> >> I guess you know frequent item set != clustering algorithms. > +1 > >> >> Can we have a chat sometime to discuss details about the implementation. > Yes sure, that would be very useful. > > Thanks > Lahiru > >> >> --Srinath >> >> >> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <[email protected]> >> wrote: >> Hi All, >> >> I have implemented another frequency counting algorithm[1] which is a >> classic algorithm for mining frequent items in a data stream. Basically >> users can specify a minimum average value of frequent items with an error >> value. >> >> This algorithm will accept two user-specified parameters: a support >> threshold s [0-1] and an error parameter e [0-1] such that e << s and >> recommended number for e is normally s/10 or s/20. Let N denote the current >> length >> of the stream, i.e., the number of tuples seen so far. At any point of time, >> this algorithm can be asked to produce a list of events along with their >> estimated frequencies. The answers produced by this algorithm will have the >> following guarantees: >> 1. All item(set)s whose true frequency exceeds sN are outputs. >> There are no false negatives . >> 2. No events whose true frequency is less than (s - e) N >> is output. >> 3. Estimated frequencies are less than the true frequencies >> by at most eN. >> . >> The incoming stream is conceptually divided in to buckets of width w = >> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , >> starting from 1.We denote the current bucket id by bcurrent whose value is >> ceiling(N/w). For an element , we denote its true frequency in the stream >> seen so far by "fe" . Note that e and w are fixed for a data stream while >> N, bcurrent and fe are the variables whose value changes when the stream >> progress. >> Here the data structure , is a set of entries are tuples with the form of >> (event, f, delta), where event is the actual event and "f" is the is an >> integer representing its estimated frequency, and delta is the maximum >> possible error in "fe". >> >> In this algorithm every new event will anyways added in to the >> data-structure optimistically and there is no initial condition to enter in >> to the data-structure but iteratively if certain events are not match to the >> condition provided >> by the user those events will be removed when N%windowSize = 0, during this >> I output those events as expired-events in the window. For the incoming >> events if event is already exists I output those as current-events and if >> its a new >> event I just add it to the data-structure and iterate through all the events >> available and find the matching events based on s and e values and only >> those events will be output. >> >> I am not sure based on the window definition this is the correct approach or >> may be windows aren't the best way to associate this algorithm in to siddhi. >> I have attached my patch to jira[2] and if you can look that would be great. >> >> Siddhi Query will looks like below, >> from cseEventStream#window.lossyFrequent(0.1,0.01) " + >> "select symbol, price >> " + >> "insert into >> StockQuote; >> >> >> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency >> counts over data streams. In Proceedings of the 28th international >> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357. >> >> Regards >> Lahiru >> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote: >> >>> Hi All, >>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote: >>> >>>> Hi All, >>>> >>>> I am planning to evaluate different event stream clustering algorithms as >>>> part of my studies(I am a graduate student at indiana University). I think >>>> Siddhi is a good place to experiment this, As per my understanding based >>>> on the docs Siddhi doesn't have a stream clustering interface I can use >>>> directly to plug my own algorithm. So I am thinking of first come up an >>>> interface for different clustering algorithms and add implementation of >>>> algorithms for each event stream by invoking an operation like >>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of >>>> query language. If the second option is more consistent with current model >>>> I can wrap-up the work in that way but initially focussing on first >>>> approach will be easier for me. So each algorithm can be associated to a >>>> desired event Stream or can be associated globally. If its associated with >>>> each stream algorithm will run local to each stream otherwise it will run >>>> in global context. Based on the algorithm I can provide a way to configure >>>> it with parameters. >>>> >>> I am sure I have confused with above implementation details, after looking >>> in to Siddhi extension points I figured out I just have to implement a new >>> window type. I have implemented one algorithm to keep the most frequent >>> events >>> came in a event stream. So queries can looks like below, >>> >>> from cseEventStream#window.frequent(2) " + >>> "select symbol, >>> price " + >>> "insert into >>> StockQuote; >>> >>> There are multiple algorithms to keep the most frequent events in a given >>> window size for now I just implemented a simple algorithm[1] with the >>> processing complexity of O(1) and space complexity O(n) where n is the >>> limit of the most frequent items. I have created a patch and attached it to >>> jira[2]. >>> >>> [1] Jayadev, and David Gries Misra, "Finding >>> repeated elements," in Science of computer >>> programming 2, no. >>> 2 (1982): 143-152. >>> [2]https://wso2.org/jira/browse/CEP-877 >>> >>> Thanks >>> Lahiru >>>> To start this I hope to implement a frequent item set mining algorithm >>>> which can be used to find out most frequent items of an event stream. >>>> Search engines use these kind of data to find out most frequent searches >>>> in a given time window and optimize the search queries. I can start with >>>> some algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] >>>> and then move towards more of data clustering algorithms. For the time >>>> being I will write the clustering results in to a file and later I think I >>>> can use more stable storage (either wso2 registry or other prefered way in >>>> wso2 product stack). If Siddhi or WSO2 CEP already have the capability of >>>> frequent item mining I will start with a more classification type >>>> algorithm. >>>> >>>> Your feedback will be very useful for my work. If you have requirement for >>>> any specific type of algorithms based on the real client interactions you >>>> have, I would like to know them and implement them with Siddhi and do the >>>> comparison. >>>> >>>> Thanks >>>> Lahiru >>>> _______________________________________________ >>>> Architecture mailing list >>>> [email protected] >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> _______________________________________________ >>> Architecture mailing list >>> [email protected] >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> >> >> >> -- >> ============================ >> Srinath Perera, Ph.D. >> http://people.apache.org/~hemapani/ >> http://srinathsview.blogspot.com/ >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > > > -- > ============================ > Srinath Perera, Ph.D. > http://people.apache.org/~hemapani/ > http://srinathsview.blogspot.com/ > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
