Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Lahiru Gunathilake Mon, 30 Jun 2014 10:47:54 -0700

Hi Mohan,

As I said please ignore the cep-877.patch and use the improved.cep-877. I 
cloned a new repo and ran following commands to apply the patches and worked 
fine.


149-160-172-65:siddhi lahirugunathilake$ git apply --check cep-873.patch 
149-160-172-65:siddhi lahirugunathilake$ git am --signoff < cep-873.patch 
Applying: fixing issue https://wso2.org/jira/browse/CEP-872
 149-160-172-65:siddhi lahirugunathilake$ git apply --check improved.cep-877 
149-160-172-65:siddhi lahirugunathilake$ git am --signoff <improved.cep-877 
Applying: adding Lossy Counting algorithm as a window implementation
Applying: improving the counting algorithm to count based on each attribute and 
adding test-cases for both algorithms
Applying: modifying lossy outputs not to emit tuples even they already exists 
but always have to match the frequency range
Applying: reverting the pom file changes
149-160-172-65:siddhi lahirugunathilake$



Regards
Lahiru
On Jun 30, 2014, at 7:31 AM, Mohanadarshan Vivekanandalingam wrote:

> Hi Lahiru,
> 
> I have gone through above patches and play around with it little bit..
> CEP873.patch seems fine and i have done some basic testing  But i have got 
> some issues when applying CEP877.patch. Even-though i am able to get rid off 
> that after making some changes, I think it is better if you send proper patch 
> for that.
> 
> Will commit these improvements once i got the patch for CEP877.  
> 
> Thanks,
> Mohan
>  
> 
> 
> On Mon, Jun 30, 2014 at 10:27 AM, Mohanadarshan Vivekanandalingam 
> <[email protected]> wrote:
> 
> 
> On Sat, Jun 28, 2014 at 2:22 PM, Lahiru Gunathilake <[email protected]> 
> wrote:
> Hi Mohan,
> 
> I have attached the latest patch for cep-877(I removed the old patch from the 
> jira). But the very first patch I have attached in CEP-873 is required by 
> this patch.
> 
> I made few improvements to both the algorithms where you can give the 
> attributes you want to count. Initial version I did was to count distinct 
> tuples but practically I think counting distinct attributes is going to be 
> useful. If user doesn't give any attribute I simply count the distinct tuple 
> with all the attributes.
> 
> I have added two test cases but I can see in the build cluster test cases are 
> removed, I did test locally only with my test cases and worked fine.
> 
> If you have any issues with the two patches please let me know.
> 
> 
> OK Lahiru, I'll go through that Today...
> 
> Thanks,
> Mohan
>  
> Thanks
> Lahiru 
> On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote:
> 
>> Hi Lahiru,
>> 
>> As Srinath has mentioned as well, frequency counting algorithms are very 
>> useful in financial market scenarios as well (especially fraud detection and 
>> surveillance). 
>> Thanks for doing this and I will take a look too.
>> 
>> seshika
>> 
>> 
>> On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam 
>> <[email protected]> wrote:
>> 
>> 
>> 
>> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <[email protected]> 
>> wrote:
>> Hi Mohan,
>> 
>> Hi Lahiru,
>>  
>> 
>> I wrote some samples but I can write more test-cases and provide another 
>> patch. Please feel free to change the naming of the windows as you like. 
>> 
>> Really appreciate your contribution.. Sure, I'll start look into this..
>> 
>> Thanks,
>> Mohan
>> 
>> 
>> Regards
>> Lahiru
>> 
>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>> 
>>> Hi All, 
>>> 
>>> Lahiru and myself had a call today morning. 
>>> 
>>> Plan is to
>>> 1) Lahiru to look at hoeffding tree and other classification algorithms and 
>>> select one to implement. He will compare the performance against MOA or 
>>> some other implementation. 
>>> 2) then we will use it for a  Fraud analysis scenario as a proof of its 
>>> validity. 
>>> 
>>> Then we will decide how to continue from that point. 
>>> 
>>> Mohan could you look at the patch? Lahiru will write test cases that you 
>>> can use to verify. 
>>> 
>>> --Srinath
>>> 
>>> 
>>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]> 
>>> wrote:
>>> Hi Srinath,
>>> 
>>> Thanks for the response. 
>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>> 
>>>> Hi Lahiru, 
>>>> 
>>>> Sorry for not responding earlier. I was traveling last week. 
>>>> 
>>>> I guess you know frequent item set !=  clustering algorithms. 
>>> +1
>>> 
>>>> 
>>>> Can we have a chat sometime to discuss details about the implementation. 
>>> Yes sure, that would be very useful.
>>> 
>>> Thanks
>>> Lahiru
>>> 
>>>> 
>>>> --Srinath
>>>> 
>>>> 
>>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <[email protected]> 
>>>> wrote:
>>>> Hi All,
>>>> 
>>>> I have implemented another frequency counting algorithm[1] which is a 
>>>> classic algorithm for mining frequent items in a data stream. Basically 
>>>> users can specify a minimum average value of frequent items with an error 
>>>> value. 
>>>> 
>>>> This algorithm will accept two user-specified parameters: a support 
>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and 
>>>> recommended number for e is normally s/10 or s/20. Let N denote the 
>>>> current length
>>>> of the stream, i.e., the number of tuples seen so far. At any point of 
>>>> time, this algorithm can be asked to produce a list of events along with 
>>>> their estimated frequencies. The answers produced by this algorithm will 
>>>> have the
>>>> following guarantees:
>>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>>  There are no false negatives .
>>>> 2. No events whose true frequency is less than (s - e) N 
>>>> is output.
>>>>  3. Estimated frequencies are less  than the true frequencies
>>>> by at most eN.
>>>>  .
>>>> The incoming stream is conceptually divided in to buckets of width w = 
>>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids , 
>>>> starting from 1.We denote the current bucket id  by bcurrent whose value 
>>>> is ceiling(N/w). For an element , we denote its true frequency in the 
>>>> stream seen so far by "fe"  . Note that e and w are fixed for a data 
>>>> stream while N, bcurrent and fe are the variables whose value changes when 
>>>> the stream progress.
>>>> Here the data structure ,  is a set of entries are tuples with the form of 
>>>> (event, f, delta), where event is the actual event and "f" is the  is an 
>>>> integer representing its estimated frequency, and delta is the maximum 
>>>> possible error in "fe".
>>>> 
>>>>  In this algorithm every new event will anyways added in to the 
>>>> data-structure optimistically and there is no initial condition to enter 
>>>> in to the data-structure but iteratively if certain events are not match 
>>>> to the condition provided
>>>> by the user those events will be removed when N%windowSize = 0, during 
>>>> this I output those events as expired-events in the window. For the 
>>>> incoming events if event is already exists I output those as 
>>>> current-events and if its a new
>>>> event I just add it to the data-structure and iterate through all the 
>>>> events available and find the matching events based on s and e values and 
>>>> only those events will be output.
>>>> 
>>>> I am not sure based on the window definition this is the correct approach 
>>>> or may be windows aren't the best way to associate this algorithm in to 
>>>> siddhi. I have attached my patch to jira[2] and if you can look that would 
>>>> be great.
>>>>     
>>>> Siddhi Query will looks like below,
>>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>>                                                        "select symbol, 
>>>> price " +
>>>>                                                        "insert into 
>>>> StockQuote;
>>>> 
>>>> 
>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency 
>>>> counts over data streams. In Proceedings of the 28th international 
>>>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>>> 
>>>> Regards
>>>> Lahiru
>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>> 
>>>>> Hi All,
>>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I am planning to evaluate different event stream clustering algorithms 
>>>>>> as part of my studies(I am a graduate student at indiana University). I 
>>>>>> think Siddhi is a good place to experiment this, As per my understanding 
>>>>>> based on the docs Siddhi doesn't have a stream clustering interface I 
>>>>>> can use directly to plug my own algorithm. So I am thinking of first 
>>>>>> come up an interface for different clustering algorithms and add 
>>>>>> implementation of algorithms for each event stream by invoking an 
>>>>>> operation like SiddhiManager.addQuery. Or I can make the algorithm 
>>>>>> configure as part of query language. If the second option is more 
>>>>>> consistent with current model I can wrap-up the work in that way but 
>>>>>> initially focussing on first approach will be easier for me. So each 
>>>>>> algorithm can be associated to a desired event Stream or can be 
>>>>>> associated globally. If its associated with each stream algorithm will 
>>>>>> run local to each stream otherwise it will run in global context. Based 
>>>>>> on the algorithm I can provide a way to configure it with parameters.
>>>>>> 
>>>>> I am sure I have confused with above implementation details, after 
>>>>> looking in to Siddhi extension points I figured out I just have to 
>>>>> implement a new window type. I have implemented one algorithm to keep the 
>>>>> most frequent events 
>>>>> came in a event stream. So queries can looks like below,
>>>>> 
>>>>> from  cseEventStream#window.frequent(2) " +
>>>>>                                                        "select symbol, 
>>>>> price " +
>>>>>                                                        "insert into 
>>>>> StockQuote;
>>>>> 
>>>>> There are multiple algorithms to keep the most frequent events in a given 
>>>>> window size for now I just implemented a simple algorithm[1] with the 
>>>>> processing complexity of O(1) and space complexity O(n) where n is the 
>>>>> limit of the most frequent items. I have created a patch and attached it 
>>>>> to jira[2].
>>>>> 
>>>>> [1]       Jayadev,        and     David   Gries   Misra,  "Finding        
>>>>> repeated        elements,"      in      Science of      computer        
>>>>> programming     2,      no.     
>>>>> 2 (1982): 143-152.
>>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>>> 
>>>>> Thanks
>>>>> Lahiru
>>>>>> To start this I hope to implement a frequent item set mining algorithm 
>>>>>> which can be used to find out most frequent items of an event stream. 
>>>>>> Search engines use these kind of data to find out most frequent searches 
>>>>>> in a given time window and optimize the search queries. I can start with 
>>>>>> some algorithms like Misra-Gries algorithm[1] and Manku and       
>>>>>> Motwani [2] and then move towards more of data clustering algorithms. 
>>>>>> For the time being I will write the clustering results in to a file and 
>>>>>> later I think I can use more stable storage (either wso2 registry or 
>>>>>> other prefered way in wso2 product stack). If Siddhi or WSO2 CEP already 
>>>>>> have the capability of frequent item mining I will start with a more 
>>>>>> classification type algorithm.
>>>>>> 
>>>>>> Your feedback will be very useful for my work. If you have requirement 
>>>>>> for any specific type of algorithms based on the real client 
>>>>>> interactions you have, I would like to know them and implement them with 
>>>>>> Siddhi and do the comparison.
>>>>>> 
>>>>>> Thanks
>>>>>> Lahiru
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>> 
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ============================
>>>> Srinath Perera, Ph.D.
>>>>    http://people.apache.org/~hemapani/
>>>>    http://srinathsview.blogspot.com/
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>> 
>>> 
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
>> 
>> 
>> -- 
>> V. Mohanadarshan
>> Software Engineer,
>> Data Technologies Team,
>> WSO2, Inc. http://wso2.com 
>> lean.enterprise.middleware.
>> 
>> email: [email protected] 
>> phone:(+94) 771117673
>> 
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>> 
>> 
> 
> 
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> 
> 
> 
> 
> -- 
> V. Mohanadarshan
> Software Engineer,
> Data Technologies Team,
> WSO2, Inc. http://wso2.com 
> lean.enterprise.middleware.
> 
> email: [email protected] 
> phone:(+94) 771117673
> 
> 
> 
> -- 
> V. Mohanadarshan
> Software Engineer,
> Data Technologies Team,
> WSO2, Inc. http://wso2.com 
> lean.enterprise.middleware.
> 
> email: [email protected] 
> phone:(+94) 771117673
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to