Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Lahiru Gunathilake Sat, 28 Jun 2014 01:54:07 -0700

Hi Mohan,

I have attached the latest patch for cep-877(I removed the old patch from the 
jira). But the very first patch I have attached in CEP-873 is required by this 
patch.


I made few improvements to both the algorithms where you can give the 
attributes you want to count. Initial version I did was to count distinct 
tuples but practically I think counting distinct attributes is going to be 
useful. If user doesn't give any attribute I simply count the distinct tuple 
with all the attributes.

I have added two test cases but I can see in the build cluster test cases are 
removed, I did test locally only with my test cases and worked fine.

If you have any issues with the two patches please let me know.

Thanks
Lahiru 
On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote:

> Hi Lahiru,
> 
> As Srinath has mentioned as well, frequency counting algorithms are very 
> useful in financial market scenarios as well (especially fraud detection and 
> surveillance). 
> Thanks for doing this and I will take a look too.
> 
> seshika
> 
> 
> On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam 
> <[email protected]> wrote:
> 
> 
> 
> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <[email protected]> 
> wrote:
> Hi Mohan,
> 
> Hi Lahiru,
>  
> 
> I wrote some samples but I can write more test-cases and provide another 
> patch. Please feel free to change the naming of the windows as you like. 
> 
> Really appreciate your contribution.. Sure, I'll start look into this..
> 
> Thanks,
> Mohan
> 
> 
> Regards
> Lahiru
> 
> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
> 
>> Hi All, 
>> 
>> Lahiru and myself had a call today morning. 
>> 
>> Plan is to
>> 1) Lahiru to look at hoeffding tree and other classification algorithms and 
>> select one to implement. He will compare the performance against MOA or some 
>> other implementation. 
>> 2) then we will use it for a  Fraud analysis scenario as a proof of its 
>> validity. 
>> 
>> Then we will decide how to continue from that point. 
>> 
>> Mohan could you look at the patch? Lahiru will write test cases that you can 
>> use to verify. 
>> 
>> --Srinath
>> 
>> 
>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]> 
>> wrote:
>> Hi Srinath,
>> 
>> Thanks for the response. 
>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>> 
>>> Hi Lahiru, 
>>> 
>>> Sorry for not responding earlier. I was traveling last week. 
>>> 
>>> I guess you know frequent item set !=  clustering algorithms. 
>> +1
>> 
>>> 
>>> Can we have a chat sometime to discuss details about the implementation. 
>> Yes sure, that would be very useful.
>> 
>> Thanks
>> Lahiru
>> 
>>> 
>>> --Srinath
>>> 
>>> 
>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <[email protected]> 
>>> wrote:
>>> Hi All,
>>> 
>>> I have implemented another frequency counting algorithm[1] which is a 
>>> classic algorithm for mining frequent items in a data stream. Basically 
>>> users can specify a minimum average value of frequent items with an error 
>>> value. 
>>> 
>>> This algorithm will accept two user-specified parameters: a support 
>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and 
>>> recommended number for e is normally s/10 or s/20. Let N denote the current 
>>> length
>>> of the stream, i.e., the number of tuples seen so far. At any point of 
>>> time, this algorithm can be asked to produce a list of events along with 
>>> their estimated frequencies. The answers produced by this algorithm will 
>>> have the
>>> following guarantees:
>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>  There are no false negatives .
>>> 2. No events whose true frequency is less than (s - e) N 
>>> is output.
>>>  3. Estimated frequencies are less  than the true frequencies
>>> by at most eN.
>>>  .
>>> The incoming stream is conceptually divided in to buckets of width w = 
>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , 
>>> starting from 1.We denote the current bucket id  by bcurrent whose value is 
>>> ceiling(N/w). For an element , we denote its true frequency in the stream 
>>> seen so far by "fe"  . Note that e and w are fixed for a data stream while 
>>> N, bcurrent and fe are the variables whose value changes when the stream 
>>> progress.
>>> Here the data structure ,  is a set of entries are tuples with the form of 
>>> (event, f, delta), where event is the actual event and "f" is the  is an 
>>> integer representing its estimated frequency, and delta is the maximum 
>>> possible error in "fe".
>>> 
>>>  In this algorithm every new event will anyways added in to the 
>>> data-structure optimistically and there is no initial condition to enter in 
>>> to the data-structure but iteratively if certain events are not match to 
>>> the condition provided
>>> by the user those events will be removed when N%windowSize = 0, during this 
>>> I output those events as expired-events in the window. For the incoming 
>>> events if event is already exists I output those as current-events and if 
>>> its a new
>>> event I just add it to the data-structure and iterate through all the 
>>> events available and find the matching events based on s and e values and 
>>> only those events will be output.
>>> 
>>> I am not sure based on the window definition this is the correct approach 
>>> or may be windows aren't the best way to associate this algorithm in to 
>>> siddhi. I have attached my patch to jira[2] and if you can look that would 
>>> be great.
>>>     
>>> Siddhi Query will looks like below,
>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>                                                        "select symbol, 
>>> price " +
>>>                                                        "insert into 
>>> StockQuote;
>>> 
>>> 
>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency 
>>> counts over data streams. In Proceedings of the 28th international 
>>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>> 
>>> Regards
>>> Lahiru
>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>> 
>>>> Hi All,
>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I am planning to evaluate different event stream clustering algorithms as 
>>>>> part of my studies(I am a graduate student at indiana University). I 
>>>>> think Siddhi is a good place to experiment this, As per my understanding 
>>>>> based on the docs Siddhi doesn't have a stream clustering interface I can 
>>>>> use directly to plug my own algorithm. So I am thinking of first come up 
>>>>> an interface for different clustering algorithms and add implementation 
>>>>> of algorithms for each event stream by invoking an operation like 
>>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of 
>>>>> query language. If the second option is more consistent with current 
>>>>> model I can wrap-up the work in that way but initially focussing on first 
>>>>> approach will be easier for me. So each algorithm can be associated to a 
>>>>> desired event Stream or can be associated globally. If its associated 
>>>>> with each stream algorithm will run local to each stream otherwise it 
>>>>> will run in global context. Based on the algorithm I can provide a way to 
>>>>> configure it with parameters.
>>>>> 
>>>> I am sure I have confused with above implementation details, after looking 
>>>> in to Siddhi extension points I figured out I just have to implement a new 
>>>> window type. I have implemented one algorithm to keep the most frequent 
>>>> events 
>>>> came in a event stream. So queries can looks like below,
>>>> 
>>>> from  cseEventStream#window.frequent(2) " +
>>>>                                                        "select symbol, 
>>>> price " +
>>>>                                                        "insert into 
>>>> StockQuote;
>>>> 
>>>> There are multiple algorithms to keep the most frequent events in a given 
>>>> window size for now I just implemented a simple algorithm[1] with the 
>>>> processing complexity of O(1) and space complexity O(n) where n is the 
>>>> limit of the most frequent items. I have created a patch and attached it 
>>>> to jira[2].
>>>> 
>>>> [1]        Jayadev,        and     David   Gries   Misra,  "Finding        
>>>> repeated        elements,"      in      Science of      computer        
>>>> programming     2,      no.     
>>>> 2  (1982): 143-152.
>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>> 
>>>> Thanks
>>>> Lahiru
>>>>> To start this I hope to implement a frequent item set mining algorithm 
>>>>> which can be used to find out most frequent items of an event stream. 
>>>>> Search engines use these kind of data to find out most frequent searches 
>>>>> in a given time window and optimize the search queries. I can start with 
>>>>> some algorithms like Misra-Gries algorithm[1] and Manku and        
>>>>> Motwani [2] and then move towards more of data clustering algorithms. For 
>>>>> the time being I will write the clustering results in to a file and later 
>>>>> I think I can use more stable storage (either wso2 registry or other 
>>>>> prefered way in wso2 product stack). If Siddhi or WSO2 CEP already have 
>>>>> the capability of frequent item mining I will start with a more 
>>>>> classification type algorithm.
>>>>> 
>>>>> Your feedback will be very useful for my work. If you have requirement 
>>>>> for any specific type of algorithms based on the real client interactions 
>>>>> you have, I would like to know them and implement them with Siddhi and do 
>>>>> the comparison.
>>>>> 
>>>>> Thanks
>>>>> Lahiru
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>> 
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>> 
>>> 
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
>> 
>> 
>> -- 
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
> 
> 
> 
> -- 
> V. Mohanadarshan
> Software Engineer,
> Data Technologies Team,
> WSO2, Inc. http://wso2.com 
> lean.enterprise.middleware.
> 
> email: [email protected] 
> phone:(+94) 771117673
> 
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to