Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Mohanadarshan Vivekanandalingam Mon, 30 Jun 2014 04:32:32 -0700

Hi Lahiru,

I have gone through above patches and play around with it little bit..
CEP873.patch seems fine and i have done some basic testing  But i have got
some issues when applying CEP877.patch. Even-though i am able to get rid
off that after making some changes, I think it is better if you send proper
patch for that.


Will commit these improvements once i got the patch for CEP877.

Thanks,
Mohan



On Mon, Jun 30, 2014 at 10:27 AM, Mohanadarshan Vivekanandalingam <
[email protected]> wrote:

>
>
> On Sat, Jun 28, 2014 at 2:22 PM, Lahiru Gunathilake <[email protected]>
> wrote:
>
>> Hi Mohan,
>>
>> I have attached the latest patch for cep-877(I removed the old patch from
>> the jira). But the very first patch I have attached in CEP-873 is required
>> by this patch.
>>
>> I made few improvements to both the algorithms where you can give the
>> attributes you want to count. Initial version I did was to count distinct
>> tuples but practically I think counting distinct attributes is going to be
>> useful. If user doesn't give any attribute I simply count the distinct
>> tuple with all the attributes.
>>
>> I have added two test cases but I can see in the build cluster test cases
>> are removed, I did test locally only with my test cases and worked fine.
>>
>> If you have any issues with the two patches please let me know.
>>
>>
> OK Lahiru, I'll go through that Today...
>
> Thanks,
> Mohan
>
>
>> Thanks
>> Lahiru
>> On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote:
>>
>> Hi Lahiru,
>>
>> As Srinath has mentioned as well, frequency counting algorithms are very
>> useful in financial market scenarios as well (especially fraud detection
>> and surveillance).
>> Thanks for doing this and I will take a look too.
>>
>> seshika
>>
>>
>> On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam <
>> [email protected]> wrote:
>>
>>>
>>>
>>>
>>> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <
>>> [email protected]> wrote:
>>>
>>>> Hi Mohan,
>>>>
>>>
>>> Hi Lahiru,
>>>
>>>
>>>>
>>>> I wrote some samples but I can write more test-cases and provide
>>>> another patch. Please feel free to change the naming of the windows as you
>>>> like.
>>>>
>>>
>>> Really appreciate your contribution.. Sure, I'll start look into this..
>>>
>>> Thanks,
>>> Mohan
>>>
>>>
>>>> Regards
>>>> Lahiru
>>>>
>>>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>>>>
>>>> Hi All,
>>>>
>>>> Lahiru and myself had a call today morning.
>>>>
>>>> Plan is to
>>>> 1) Lahiru to look at hoeffding tree and other classification algorithms
>>>> and select one to implement. He will compare the performance against MOA or
>>>> some other implementation.
>>>> 2) then we will use it for a  Fraud analysis scenario as a proof of its
>>>> validity.
>>>>
>>>> Then we will decide how to continue from that point.
>>>>
>>>> Mohan could you look at the patch? Lahiru will write test cases that
>>>> you can use to verify.
>>>>
>>>> --Srinath
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Srinath,
>>>>>
>>>>> Thanks for the response.
>>>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>>>>
>>>>> Hi Lahiru,
>>>>>
>>>>> Sorry for not responding earlier. I was traveling last week.
>>>>>
>>>>> I guess you know frequent item set !=  clustering algorithms.
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>> Can we have a chat sometime to discuss details about the
>>>>> implementation.
>>>>>
>>>>> Yes sure, that would be very useful.
>>>>>
>>>>> Thanks
>>>>> Lahiru
>>>>>
>>>>>
>>>>> --Srinath
>>>>>
>>>>>
>>>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have implemented another frequency counting algorithm[1] which is a
>>>>>> classic algorithm for mining frequent items in a data stream. Basically
>>>>>> users can specify a minimum average value of frequent items with an error
>>>>>> value.
>>>>>>
>>>>>> This algorithm will accept two user-specified parameters: a support
>>>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s
>>>>>> and recommended number for e is normally s/10 or s/20. Let N denote
>>>>>> the current length
>>>>>> of the stream, i.e., the number of tuples seen so far. At any point
>>>>>> of time, this algorithm can be asked to produce a list of events
>>>>>> along with their estimated frequencies. The answers produced by this
>>>>>> algorithm will have the
>>>>>> following guarantees:
>>>>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>>>>  There are no false negatives .
>>>>>> 2. No events whose true frequency is less than (s - e) N
>>>>>> is output.
>>>>>>  3. Estimated frequencies are less  than the true frequencies
>>>>>> by at most eN.
>>>>>>  .
>>>>>> The incoming stream is conceptually divided in to buckets of width w
>>>>>> = ceiling(1/e) transactions each. Buckets are labeled with bucket
>>>>>> ids , starting from 1.We denote the current bucket id  by bcurrent
>>>>>> whose value is ceiling(N/w). For an element , we denote its true
>>>>>> frequency in the stream seen so far by "fe"  . Note that e and w are
>>>>>> fixed for a data stream while N, bcurrent and fe are the variables whose
>>>>>> value changes when the stream progress.
>>>>>> Here the data structure ,  is a set of entries are tuples with the
>>>>>> form of (event, f, delta), where event is the actual event and "f" is
>>>>>> the  is an integer representing its estimated frequency, and delta
>>>>>> is the maximum possible error in "fe".
>>>>>>
>>>>>>  In this algorithm every new event will anyways added in to the
>>>>>> data-structure optimistically and there is no initial condition to enter 
>>>>>> in
>>>>>> to the data-structure but iteratively if certain events are not match to
>>>>>> the condition provided
>>>>>> by the user those events will be removed when N%windowSize = 0,
>>>>>> during this I output those events as expired-events in the window. For 
>>>>>> the
>>>>>> incoming events if event is already exists I output those as 
>>>>>> current-events
>>>>>> and if its a new
>>>>>> event I just add it to the data-structure and iterate through all the
>>>>>> events available and find the matching events based on s and e values and
>>>>>> only those events will be output.
>>>>>>
>>>>>> I am not sure based on the window definition this is the correct
>>>>>> approach or may be windows aren't the best way to associate this 
>>>>>> algorithm
>>>>>> in to siddhi. I have attached my patch to jira[2] and if you can look 
>>>>>> that
>>>>>> would be great.
>>>>>>
>>>>>> Siddhi Query will looks like below,
>>>>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>>>>                                                        "select
>>>>>> symbol, price " +
>>>>>>                                                        "insert into
>>>>>> StockQuote;
>>>>>>
>>>>>>
>>>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate
>>>>>> frequency counts over data streams. In Proceedings of the 28th
>>>>>> international conference on Very Large Data Bases (VLDB '02). VLDB
>>>>>> Endowment 346-357.
>>>>>>
>>>>>> Regards
>>>>>> Lahiru
>>>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>>>>
>>>>>> Hi All,
>>>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am planning to evaluate different event stream clustering
>>>>>> algorithms as part of my studies(I am a graduate student at indiana
>>>>>> University). I think Siddhi is a good place to experiment this, As per my
>>>>>> understanding based on the docs Siddhi doesn't have a stream clustering
>>>>>> interface I can use directly to plug my own algorithm. So I am thinking 
>>>>>> of
>>>>>> first come up an interface for different clustering algorithms and add
>>>>>> implementation of algorithms for each event stream by invoking an 
>>>>>> operation
>>>>>> like SiddhiManager.addQuery. Or I can make the algorithm configure as 
>>>>>> part
>>>>>> of query language. If the second option is more consistent with current
>>>>>> model I can wrap-up the work in that way but initially focussing on first
>>>>>> approach will be easier for me. So each algorithm can be associated to a
>>>>>> desired event Stream or can be associated globally. If its associated 
>>>>>> with
>>>>>> each stream algorithm will run local to each stream otherwise it will run
>>>>>> in global context. Based on the algorithm I can provide a way to 
>>>>>> configure
>>>>>> it with parameters.
>>>>>>
>>>>>> I am sure I have confused with above implementation details, after
>>>>>> looking in to Siddhi extension points I figured out I just have to
>>>>>> implement a new window type. I have implemented one algorithm to keep the
>>>>>> most frequent events
>>>>>> came in a event stream. So queries can looks like below,
>>>>>>
>>>>>> from  cseEventStream#window.frequent(2) " +
>>>>>>                                                        "select
>>>>>> symbol, price " +
>>>>>>                                                        "insert into
>>>>>> StockQuote;
>>>>>>
>>>>>> There are multiple algorithms to keep the most frequent events in a
>>>>>> given window size for now I just implemented a simple algorithm[1] with 
>>>>>> the
>>>>>> processing complexity of O(1) and space complexity O(n) where n is the
>>>>>> limit of the most frequent items. I have created a patch and attached it 
>>>>>> to
>>>>>> jira[2].
>>>>>>
>>>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>>>>>> Science of computer programming 2, no.
>>>>>> 2 (1982): 143-152.
>>>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Lahiru
>>>>>>
>>>>>> To start this I hope to implement a frequent item set mining
>>>>>> algorithm which can be used to find out most frequent items of an event
>>>>>> stream. Search engines use these kind of data to find out most frequent
>>>>>> searches in a given time window and optimize the search queries. I can
>>>>>> start with some algorithms like Misra-Gries algorithm[1] and Manku
>>>>>> and Motwani [2] and then move towards more of data clustering
>>>>>> algorithms. For the time being I will write the clustering results in to 
>>>>>> a
>>>>>> file and later I think I can use more stable storage (either wso2 
>>>>>> registry
>>>>>> or other prefered way in wso2 product stack). If Siddhi or WSO2 CEP 
>>>>>> already
>>>>>> have the capability of frequent item mining I will start with a more
>>>>>> classification type algorithm.
>>>>>>
>>>>>> Your feedback will be very useful for my work. If you have
>>>>>> requirement for any specific type of algorithms based on the real client
>>>>>> interactions you have, I would like to know them and implement them with
>>>>>> Siddhi and do the comparison.
>>>>>>
>>>>>> Thanks
>>>>>> Lahiru
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Srinath Perera, Ph.D.
>>>>>    http://people.apache.org/~hemapani/
>>>>>    http://srinathsview.blogspot.com/
>>>>>  _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Srinath Perera, Ph.D.
>>>>    http://people.apache.org/~hemapani/
>>>>    http://srinathsview.blogspot.com/
>>>>  _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *V. Mohanadarshan*
>>> *Software Engineer,*
>>> *Data Technologies Team,*
>>> *WSO2, Inc. http://wso2.com <http://wso2.com/> *
>>> *lean.enterprise.middleware.*
>>>
>>> email: [email protected]
>>> phone:(+94) 771117673
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> *V. Mohanadarshan*
> *Software Engineer,*
> *Data Technologies Team,*
> *WSO2, Inc. http://wso2.com <http://wso2.com> *
> *lean.enterprise.middleware.*
>
> email: [email protected]
> phone:(+94) 771117673
>



-- 
*V. Mohanadarshan*
*Software Engineer,*
*Data Technologies Team,*
*WSO2, Inc. http://wso2.com <http://wso2.com> *
*lean.enterprise.middleware.*

email: [email protected]
phone:(+94) 771117673

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to