Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Lahiru Gunathilake Thu, 26 Jun 2014 21:31:26 -0700

Hi Mohan,

I wrote some samples but I can write more test-cases and provide another patch. 
Please feel free to change the naming of the windows as you like.


Regards
Lahiru
On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:

> Hi All, 
> 
> Lahiru and myself had a call today morning. 
> 
> Plan is to
> 1) Lahiru to look at hoeffding tree and other classification algorithms and 
> select one to implement. He will compare the performance against MOA or some 
> other implementation. 
> 2) then we will use it for a  Fraud analysis scenario as a proof of its 
> validity. 
> 
> Then we will decide how to continue from that point. 
> 
> Mohan could you look at the patch? Lahiru will write test cases that you can 
> use to verify. 
> 
> --Srinath
> 
> 
> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]> 
> wrote:
> Hi Srinath,
> 
> Thanks for the response. 
> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
> 
>> Hi Lahiru, 
>> 
>> Sorry for not responding earlier. I was traveling last week. 
>> 
>> I guess you know frequent item set !=  clustering algorithms. 
> +1
> 
>> 
>> Can we have a chat sometime to discuss details about the implementation. 
> Yes sure, that would be very useful.
> 
> Thanks
> Lahiru
> 
>> 
>> --Srinath
>> 
>> 
>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <[email protected]> 
>> wrote:
>> Hi All,
>> 
>> I have implemented another frequency counting algorithm[1] which is a 
>> classic algorithm for mining frequent items in a data stream. Basically 
>> users can specify a minimum average value of frequent items with an error 
>> value. 
>> 
>> This algorithm will accept two user-specified parameters: a support 
>> threshold s [0-1] and an error parameter e [0-1] such that e << s and 
>> recommended number for e is normally s/10 or s/20. Let N denote the current 
>> length
>> of the stream, i.e., the number of tuples seen so far. At any point of time, 
>> this algorithm can be asked to produce a list of events along with their 
>> estimated frequencies. The answers produced by this algorithm will have the
>> following guarantees:
>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>  There are no false negatives .
>> 2. No events whose true frequency is less than (s - e) N 
>> is output.
>>  3. Estimated frequencies are less  than the true frequencies
>> by at most eN.
>>  .
>> The incoming stream is conceptually divided in to buckets of width w = 
>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , 
>> starting from 1.We denote the current bucket id  by bcurrent whose value is 
>> ceiling(N/w). For an element , we denote its true frequency in the stream 
>> seen so far by "fe"  . Note that e and w are fixed for a data stream while 
>> N, bcurrent and fe are the variables whose value changes when the stream 
>> progress.
>> Here the data structure ,  is a set of entries are tuples with the form of 
>> (event, f, delta), where event is the actual event and "f" is the  is an 
>> integer representing its estimated frequency, and delta is the maximum 
>> possible error in "fe".
>> 
>>  In this algorithm every new event will anyways added in to the 
>> data-structure optimistically and there is no initial condition to enter in 
>> to the data-structure but iteratively if certain events are not match to the 
>> condition provided
>> by the user those events will be removed when N%windowSize = 0, during this 
>> I output those events as expired-events in the window. For the incoming 
>> events if event is already exists I output those as current-events and if 
>> its a new
>> event I just add it to the data-structure and iterate through all the events 
>> available and find the matching events based on s and e values and only 
>> those events will be output.
>> 
>> I am not sure based on the window definition this is the correct approach or 
>> may be windows aren't the best way to associate this algorithm in to siddhi. 
>> I have attached my patch to jira[2] and if you can look that would be great.
>>     
>> Siddhi Query will looks like below,
>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>                                                        "select symbol, price 
>> " +
>>                                                        "insert into 
>> StockQuote;
>> 
>> 
>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency 
>> counts over data streams. In Proceedings of the 28th international 
>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>> 
>> Regards
>> Lahiru
>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>> 
>>> Hi All,
>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I am planning to evaluate different event stream clustering algorithms as 
>>>> part of my studies(I am a graduate student at indiana University). I think 
>>>> Siddhi is a good place to experiment this, As per my understanding based 
>>>> on the docs Siddhi doesn't have a stream clustering interface I can use 
>>>> directly to plug my own algorithm. So I am thinking of first come up an 
>>>> interface for different clustering algorithms and add implementation of 
>>>> algorithms for each event stream by invoking an operation like 
>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of 
>>>> query language. If the second option is more consistent with current model 
>>>> I can wrap-up the work in that way but initially focussing on first 
>>>> approach will be easier for me. So each algorithm can be associated to a 
>>>> desired event Stream or can be associated globally. If its associated with 
>>>> each stream algorithm will run local to each stream otherwise it will run 
>>>> in global context. Based on the algorithm I can provide a way to configure 
>>>> it with parameters.
>>>> 
>>> I am sure I have confused with above implementation details, after looking 
>>> in to Siddhi extension points I figured out I just have to implement a new 
>>> window type. I have implemented one algorithm to keep the most frequent 
>>> events 
>>> came in a event stream. So queries can looks like below,
>>> 
>>> from  cseEventStream#window.frequent(2) " +
>>>                                                        "select symbol, 
>>> price " +
>>>                                                        "insert into 
>>> StockQuote;
>>> 
>>> There are multiple algorithms to keep the most frequent events in a given 
>>> window size for now I just implemented a simple algorithm[1] with the 
>>> processing complexity of O(1) and space complexity O(n) where n is the 
>>> limit of the most frequent items. I have created a patch and attached it to 
>>> jira[2].
>>> 
>>> [1] Jayadev,        and     David   Gries   Misra,  "Finding        
>>> repeated        elements,"      in      Science of      computer        
>>> programming     2,      no.     
>>> 2   (1982): 143-152.
>>> [2]https://wso2.org/jira/browse/CEP-877
>>> 
>>> Thanks
>>> Lahiru
>>>> To start this I hope to implement a frequent item set mining algorithm 
>>>> which can be used to find out most frequent items of an event stream. 
>>>> Search engines use these kind of data to find out most frequent searches 
>>>> in a given time window and optimize the search queries. I can start with 
>>>> some algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] 
>>>> and then move towards more of data clustering algorithms. For the time 
>>>> being I will write the clustering results in to a file and later I think I 
>>>> can use more stable storage (either wso2 registry or other prefered way in 
>>>> wso2 product stack). If Siddhi or WSO2 CEP already have the capability of 
>>>> frequent item mining I will start with a more classification type 
>>>> algorithm.
>>>> 
>>>> Your feedback will be very useful for my work. If you have requirement for 
>>>> any specific type of algorithms based on the real client interactions you 
>>>> have, I would like to know them and implement them with Siddhi and do the 
>>>> comparison.
>>>> 
>>>> Thanks
>>>> Lahiru
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>> 
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
>> 
>> 
>> -- 
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
> 
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
> 
> 
> 
> -- 
> ============================
> Srinath Perera, Ph.D.
>    http://people.apache.org/~hemapani/
>    http://srinathsview.blogspot.com/
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to