Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Mohanadarshan Vivekanandalingam Sun, 29 Jun 2014 21:58:27 -0700

On Sat, Jun 28, 2014 at 2:22 PM, Lahiru Gunathilake <[email protected]>
wrote:


> Hi Mohan,
>
> I have attached the latest patch for cep-877(I removed the old patch from
> the jira). But the very first patch I have attached in CEP-873 is required
> by this patch.
>
> I made few improvements to both the algorithms where you can give the
> attributes you want to count. Initial version I did was to count distinct
> tuples but practically I think counting distinct attributes is going to be
> useful. If user doesn't give any attribute I simply count the distinct
> tuple with all the attributes.
>
> I have added two test cases but I can see in the build cluster test cases
> are removed, I did test locally only with my test cases and worked fine.
>
> If you have any issues with the two patches please let me know.
>
>
OK Lahiru, I'll go through that Today...

Thanks,
Mohan


> Thanks
> Lahiru
> On Jun 27, 2014, at 2:03 AM, Seshika Fernando wrote:
>
> Hi Lahiru,
>
> As Srinath has mentioned as well, frequency counting algorithms are very
> useful in financial market scenarios as well (especially fraud detection
> and surveillance).
> Thanks for doing this and I will take a look too.
>
> seshika
>
>
> On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam <
> [email protected]> wrote:
>
>>
>>
>>
>> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <
>> [email protected]> wrote:
>>
>>> Hi Mohan,
>>>
>>
>> Hi Lahiru,
>>
>>
>>>
>>> I wrote some samples but I can write more test-cases and provide another
>>> patch. Please feel free to change the naming of the windows as you like.
>>>
>>
>> Really appreciate your contribution.. Sure, I'll start look into this..
>>
>> Thanks,
>> Mohan
>>
>>
>>> Regards
>>> Lahiru
>>>
>>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>>>
>>> Hi All,
>>>
>>> Lahiru and myself had a call today morning.
>>>
>>> Plan is to
>>> 1) Lahiru to look at hoeffding tree and other classification algorithms
>>> and select one to implement. He will compare the performance against MOA or
>>> some other implementation.
>>> 2) then we will use it for a  Fraud analysis scenario as a proof of its
>>> validity.
>>>
>>> Then we will decide how to continue from that point.
>>>
>>> Mohan could you look at the patch? Lahiru will write test cases that you
>>> can use to verify.
>>>
>>> --Srinath
>>>
>>>
>>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <
>>> [email protected]> wrote:
>>>
>>>> Hi Srinath,
>>>>
>>>> Thanks for the response.
>>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>>>
>>>> Hi Lahiru,
>>>>
>>>> Sorry for not responding earlier. I was traveling last week.
>>>>
>>>> I guess you know frequent item set !=  clustering algorithms.
>>>>
>>>> +1
>>>>
>>>>
>>>> Can we have a chat sometime to discuss details about the
>>>> implementation.
>>>>
>>>> Yes sure, that would be very useful.
>>>>
>>>> Thanks
>>>> Lahiru
>>>>
>>>>
>>>> --Srinath
>>>>
>>>>
>>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have implemented another frequency counting algorithm[1] which is a
>>>>> classic algorithm for mining frequent items in a data stream. Basically
>>>>> users can specify a minimum average value of frequent items with an error
>>>>> value.
>>>>>
>>>>> This algorithm will accept two user-specified parameters: a support
>>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and
>>>>> recommended number for e is normally s/10 or s/20. Let N denote the
>>>>> current length
>>>>> of the stream, i.e., the number of tuples seen so far. At any point
>>>>> of time, this algorithm can be asked to produce a list of events
>>>>> along with their estimated frequencies. The answers produced by this
>>>>> algorithm will have the
>>>>> following guarantees:
>>>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>>>  There are no false negatives .
>>>>> 2. No events whose true frequency is less than (s - e) N
>>>>> is output.
>>>>>  3. Estimated frequencies are less  than the true frequencies
>>>>> by at most eN.
>>>>>  .
>>>>> The incoming stream is conceptually divided in to buckets of width w =
>>>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
>>>>> starting from 1.We denote the current bucket id  by bcurrent whose
>>>>> value is ceiling(N/w). For an element , we denote its true frequency
>>>>> in the stream seen so far by "fe"  . Note that e and w are fixed for
>>>>> a data stream while N, bcurrent and fe are the variables whose value
>>>>> changes when the stream progress.
>>>>> Here the data structure ,  is a set of entries are tuples with the
>>>>> form of (event, f, delta), where event is the actual event and "f" is
>>>>> the  is an integer representing its estimated frequency, and delta is
>>>>> the maximum possible error in "fe".
>>>>>
>>>>>  In this algorithm every new event will anyways added in to the
>>>>> data-structure optimistically and there is no initial condition to enter 
>>>>> in
>>>>> to the data-structure but iteratively if certain events are not match to
>>>>> the condition provided
>>>>> by the user those events will be removed when N%windowSize = 0, during
>>>>> this I output those events as expired-events in the window. For the
>>>>> incoming events if event is already exists I output those as 
>>>>> current-events
>>>>> and if its a new
>>>>> event I just add it to the data-structure and iterate through all the
>>>>> events available and find the matching events based on s and e values and
>>>>> only those events will be output.
>>>>>
>>>>> I am not sure based on the window definition this is the correct
>>>>> approach or may be windows aren't the best way to associate this algorithm
>>>>> in to siddhi. I have attached my patch to jira[2] and if you can look that
>>>>> would be great.
>>>>>
>>>>> Siddhi Query will looks like below,
>>>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>>>                                                        "select symbol,
>>>>> price " +
>>>>>                                                        "insert into
>>>>> StockQuote;
>>>>>
>>>>>
>>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate
>>>>> frequency counts over data streams. In Proceedings of the 28th
>>>>> international conference on Very Large Data Bases (VLDB '02). VLDB
>>>>> Endowment 346-357.
>>>>>
>>>>> Regards
>>>>> Lahiru
>>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>>>
>>>>> Hi All,
>>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I am planning to evaluate different event stream clustering algorithms
>>>>> as part of my studies(I am a graduate student at indiana University). I
>>>>> think Siddhi is a good place to experiment this, As per my understanding
>>>>> based on the docs Siddhi doesn't have a stream clustering interface I can
>>>>> use directly to plug my own algorithm. So I am thinking of first come up 
>>>>> an
>>>>> interface for different clustering algorithms and add implementation of
>>>>> algorithms for each event stream by invoking an operation like
>>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
>>>>> query language. If the second option is more consistent with current model
>>>>> I can wrap-up the work in that way but initially focussing on first
>>>>> approach will be easier for me. So each algorithm can be associated to a
>>>>> desired event Stream or can be associated globally. If its associated with
>>>>> each stream algorithm will run local to each stream otherwise it will run
>>>>> in global context. Based on the algorithm I can provide a way to configure
>>>>> it with parameters.
>>>>>
>>>>> I am sure I have confused with above implementation details, after
>>>>> looking in to Siddhi extension points I figured out I just have to
>>>>> implement a new window type. I have implemented one algorithm to keep the
>>>>> most frequent events
>>>>> came in a event stream. So queries can looks like below,
>>>>>
>>>>> from  cseEventStream#window.frequent(2) " +
>>>>>                                                        "select symbol,
>>>>> price " +
>>>>>                                                        "insert into
>>>>> StockQuote;
>>>>>
>>>>> There are multiple algorithms to keep the most frequent events in a
>>>>> given window size for now I just implemented a simple algorithm[1] with 
>>>>> the
>>>>> processing complexity of O(1) and space complexity O(n) where n is the
>>>>> limit of the most frequent items. I have created a patch and attached it 
>>>>> to
>>>>> jira[2].
>>>>>
>>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>>>>> Science of computer programming 2, no.
>>>>> 2 (1982): 143-152.
>>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>>>
>>>>>
>>>>> Thanks
>>>>> Lahiru
>>>>>
>>>>> To start this I hope to implement a frequent item set mining algorithm
>>>>> which can be used to find out most frequent items of an event stream.
>>>>> Search engines use these kind of data to find out most frequent searches 
>>>>> in
>>>>> a given time window and optimize the search queries. I can start with some
>>>>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2]
>>>>> and then move towards more of data clustering algorithms. For the time
>>>>> being I will write the clustering results in to a file and later I think I
>>>>> can use more stable storage (either wso2 registry or other prefered way in
>>>>> wso2 product stack). If Siddhi or WSO2 CEP already have the capability of
>>>>> frequent item mining I will start with a more classification type 
>>>>> algorithm.
>>>>>
>>>>> Your feedback will be very useful for my work. If you have requirement
>>>>> for any specific type of algorithms based on the real client interactions
>>>>> you have, I would like to know them and implement them with Siddhi and do
>>>>> the comparison.
>>>>>
>>>>> Thanks
>>>>> Lahiru
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Srinath Perera, Ph.D.
>>>>    http://people.apache.org/~hemapani/
>>>>    http://srinathsview.blogspot.com/
>>>>  _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>>  _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>>
>>
>>
>> --
>> *V. Mohanadarshan*
>> *Software Engineer,*
>> *Data Technologies Team,*
>> *WSO2, Inc. http://wso2.com <http://wso2.com/> *
>> *lean.enterprise.middleware.*
>>
>> email: [email protected]
>> phone:(+94) 771117673
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*V. Mohanadarshan*
*Software Engineer,*
*Data Technologies Team,*
*WSO2, Inc. http://wso2.com <http://wso2.com> *
*lean.enterprise.middleware.*

email: [email protected]
phone:(+94) 771117673

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to