Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Seshika Fernando Thu, 26 Jun 2014 23:04:04 -0700

Hi Lahiru,

As Srinath has mentioned as well, frequency counting algorithms are very
useful in financial market scenarios as well (especially fraud detection
and surveillance).
Thanks for doing this and I will take a look too.


seshika


On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam <
[email protected]> wrote:

>
>
>
> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <[email protected]
> > wrote:
>
>> Hi Mohan,
>>
>
> Hi Lahiru,
>
>
>>
>> I wrote some samples but I can write more test-cases and provide another
>> patch. Please feel free to change the naming of the windows as you like.
>>
>
> Really appreciate your contribution.. Sure, I'll start look into this..
>
> Thanks,
> Mohan
>
>
>> Regards
>> Lahiru
>>
>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>>
>> Hi All,
>>
>> Lahiru and myself had a call today morning.
>>
>> Plan is to
>> 1) Lahiru to look at hoeffding tree and other classification algorithms
>> and select one to implement. He will compare the performance against MOA or
>> some other implementation.
>> 2) then we will use it for a  Fraud analysis scenario as a proof of its
>> validity.
>>
>> Then we will decide how to continue from that point.
>>
>> Mohan could you look at the patch? Lahiru will write test cases that you
>> can use to verify.
>>
>> --Srinath
>>
>>
>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]
>> > wrote:
>>
>>> Hi Srinath,
>>>
>>> Thanks for the response.
>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>>
>>> Hi Lahiru,
>>>
>>> Sorry for not responding earlier. I was traveling last week.
>>>
>>> I guess you know frequent item set !=  clustering algorithms.
>>>
>>> +1
>>>
>>>
>>> Can we have a chat sometime to discuss details about the implementation.
>>>
>>> Yes sure, that would be very useful.
>>>
>>> Thanks
>>> Lahiru
>>>
>>>
>>> --Srinath
>>>
>>>
>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <
>>> [email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have implemented another frequency counting algorithm[1] which is a
>>>> classic algorithm for mining frequent items in a data stream. Basically
>>>> users can specify a minimum average value of frequent items with an error
>>>> value.
>>>>
>>>> This algorithm will accept two user-specified parameters: a support
>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and
>>>> recommended number for e is normally s/10 or s/20. Let N denote the
>>>> current length
>>>> of the stream, i.e., the number of tuples seen so far. At any point of
>>>> time, this algorithm can be asked to produce a list of events along
>>>> with their estimated frequencies. The answers produced by this
>>>> algorithm will have the
>>>> following guarantees:
>>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>>  There are no false negatives .
>>>> 2. No events whose true frequency is less than (s - e) N
>>>> is output.
>>>>  3. Estimated frequencies are less  than the true frequencies
>>>> by at most eN.
>>>>  .
>>>> The incoming stream is conceptually divided in to buckets of width w =
>>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
>>>> starting from 1.We denote the current bucket id  by bcurrent whose
>>>> value is ceiling(N/w). For an element , we denote its true frequency
>>>> in the stream seen so far by "fe"  . Note that e and w are fixed for a
>>>> data stream while N, bcurrent and fe are the variables whose value changes
>>>> when the stream progress.
>>>> Here the data structure ,  is a set of entries are tuples with the
>>>> form of (event, f, delta), where event is the actual event and "f" is
>>>> the  is an integer representing its estimated frequency, and delta is
>>>> the maximum possible error in "fe".
>>>>
>>>>  In this algorithm every new event will anyways added in to the
>>>> data-structure optimistically and there is no initial condition to enter in
>>>> to the data-structure but iteratively if certain events are not match to
>>>> the condition provided
>>>> by the user those events will be removed when N%windowSize = 0, during
>>>> this I output those events as expired-events in the window. For the
>>>> incoming events if event is already exists I output those as current-events
>>>> and if its a new
>>>> event I just add it to the data-structure and iterate through all the
>>>> events available and find the matching events based on s and e values and
>>>> only those events will be output.
>>>>
>>>> I am not sure based on the window definition this is the correct
>>>> approach or may be windows aren't the best way to associate this algorithm
>>>> in to siddhi. I have attached my patch to jira[2] and if you can look that
>>>> would be great.
>>>>
>>>> Siddhi Query will looks like below,
>>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>>                                                        "select symbol,
>>>> price " +
>>>>                                                        "insert into
>>>> StockQuote;
>>>>
>>>>
>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency
>>>> counts over data streams. In Proceedings of the 28th international
>>>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>>>
>>>> Regards
>>>> Lahiru
>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>>
>>>> Hi All,
>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I am planning to evaluate different event stream clustering algorithms
>>>> as part of my studies(I am a graduate student at indiana University). I
>>>> think Siddhi is a good place to experiment this, As per my understanding
>>>> based on the docs Siddhi doesn't have a stream clustering interface I can
>>>> use directly to plug my own algorithm. So I am thinking of first come up an
>>>> interface for different clustering algorithms and add implementation of
>>>> algorithms for each event stream by invoking an operation like
>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
>>>> query language. If the second option is more consistent with current model
>>>> I can wrap-up the work in that way but initially focussing on first
>>>> approach will be easier for me. So each algorithm can be associated to a
>>>> desired event Stream or can be associated globally. If its associated with
>>>> each stream algorithm will run local to each stream otherwise it will run
>>>> in global context. Based on the algorithm I can provide a way to configure
>>>> it with parameters.
>>>>
>>>> I am sure I have confused with above implementation details, after
>>>> looking in to Siddhi extension points I figured out I just have to
>>>> implement a new window type. I have implemented one algorithm to keep the
>>>> most frequent events
>>>> came in a event stream. So queries can looks like below,
>>>>
>>>> from  cseEventStream#window.frequent(2) " +
>>>>                                                        "select symbol,
>>>> price " +
>>>>                                                        "insert into
>>>> StockQuote;
>>>>
>>>> There are multiple algorithms to keep the most frequent events in a
>>>> given window size for now I just implemented a simple algorithm[1] with the
>>>> processing complexity of O(1) and space complexity O(n) where n is the
>>>> limit of the most frequent items. I have created a patch and attached it to
>>>> jira[2].
>>>>
>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>>>> Science of computer programming 2, no.
>>>> 2 (1982): 143-152.
>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>>
>>>>
>>>> Thanks
>>>> Lahiru
>>>>
>>>> To start this I hope to implement a frequent item set mining algorithm
>>>> which can be used to find out most frequent items of an event stream.
>>>> Search engines use these kind of data to find out most frequent searches in
>>>> a given time window and optimize the search queries. I can start with some
>>>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and
>>>> then move towards more of data clustering algorithms. For the time being I
>>>> will write the clustering results in to a file and later I think I can use
>>>> more stable storage (either wso2 registry or other prefered way in wso2
>>>> product stack). If Siddhi or WSO2 CEP already have the capability of
>>>> frequent item mining I will start with a more classification type 
>>>> algorithm.
>>>>
>>>> Your feedback will be very useful for my work. If you have requirement
>>>> for any specific type of algorithms based on the real client interactions
>>>> you have, I would like to know them and implement them with Siddhi and do
>>>> the comparison.
>>>>
>>>> Thanks
>>>> Lahiru
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>>  _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>>  _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>>
>
>
> --
> *V. Mohanadarshan*
> *Software Engineer,*
> *Data Technologies Team,*
> *WSO2, Inc. http://wso2.com <http://wso2.com> *
> *lean.enterprise.middleware.*
>
> email: [email protected]
> phone:(+94) 771117673
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Using Siddhi Event processor to implement/evaluate some clustering algorithms

Reply via email to