Hi Lahiru,

As Srinath has mentioned as well, frequency counting algorithms are very
useful in financial market scenarios as well (especially fraud detection
and surveillance).
Thanks for doing this and I will take a look too.

seshika


On Fri, Jun 27, 2014 at 10:52 AM, Mohanadarshan Vivekanandalingam <
[email protected]> wrote:

>
>
>
> On Fri, Jun 27, 2014 at 10:00 AM, Lahiru Gunathilake <[email protected]
> > wrote:
>
>> Hi Mohan,
>>
>
> Hi Lahiru,
>
>
>>
>> I wrote some samples but I can write more test-cases and provide another
>> patch. Please feel free to change the naming of the windows as you like.
>>
>
> Really appreciate your contribution.. Sure, I'll start look into this..
>
> Thanks,
> Mohan
>
>
>> Regards
>> Lahiru
>>
>> On Jun 26, 2014, at 11:35 PM, Srinath Perera wrote:
>>
>> Hi All,
>>
>> Lahiru and myself had a call today morning.
>>
>> Plan is to
>> 1) Lahiru to look at hoeffding tree and other classification algorithms
>> and select one to implement. He will compare the performance against MOA or
>> some other implementation.
>> 2) then we will use it for a  Fraud analysis scenario as a proof of its
>> validity.
>>
>> Then we will decide how to continue from that point.
>>
>> Mohan could you look at the patch? Lahiru will write test cases that you
>> can use to verify.
>>
>> --Srinath
>>
>>
>> On Tue, Jun 24, 2014 at 4:56 AM, Lahiru Gunathilake <[email protected]
>> > wrote:
>>
>>> Hi Srinath,
>>>
>>> Thanks for the response.
>>> On Jun 23, 2014, at 10:48 PM, Srinath Perera wrote:
>>>
>>> Hi Lahiru,
>>>
>>> Sorry for not responding earlier. I was traveling last week.
>>>
>>> I guess you know frequent item set !=  clustering algorithms.
>>>
>>> +1
>>>
>>>
>>> Can we have a chat sometime to discuss details about the implementation.
>>>
>>> Yes sure, that would be very useful.
>>>
>>> Thanks
>>> Lahiru
>>>
>>>
>>> --Srinath
>>>
>>>
>>> On Sun, Jun 22, 2014 at 6:25 PM, Lahiru Gunathilake <
>>> [email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have implemented another frequency counting algorithm[1] which is a
>>>> classic algorithm for mining frequent items in a data stream. Basically
>>>> users can specify a minimum average value of frequent items with an error
>>>> value.
>>>>
>>>> This algorithm will accept two user-specified parameters: a support
>>>> threshold s [0-1] and an error parameter e [0-1] such that e << s and
>>>> recommended number for e is normally s/10 or s/20. Let N denote the
>>>> current length
>>>> of the stream, i.e., the number of tuples seen so far. At any point of
>>>> time, this algorithm can be asked to produce a list of events along
>>>> with their estimated frequencies. The answers produced by this
>>>> algorithm will have the
>>>> following guarantees:
>>>> 1. All item(set)s whose true frequency exceeds sN are outputs.
>>>>  There are no false negatives .
>>>> 2. No events whose true frequency is less than (s - e) N
>>>> is output.
>>>>  3. Estimated frequencies are less  than the true frequencies
>>>> by at most eN.
>>>>  .
>>>> The incoming stream is conceptually divided in to buckets of width w =
>>>> ceiling(1/e) transactions each. Buckets are labeled with bucket ids ,
>>>> starting from 1.We denote the current bucket id  by bcurrent whose
>>>> value is ceiling(N/w). For an element , we denote its true frequency
>>>> in the stream seen so far by "fe"  . Note that e and w are fixed for a
>>>> data stream while N, bcurrent and fe are the variables whose value changes
>>>> when the stream progress.
>>>> Here the data structure ,  is a set of entries are tuples with the
>>>> form of (event, f, delta), where event is the actual event and "f" is
>>>> the  is an integer representing its estimated frequency, and delta is
>>>> the maximum possible error in "fe".
>>>>
>>>>  In this algorithm every new event will anyways added in to the
>>>> data-structure optimistically and there is no initial condition to enter in
>>>> to the data-structure but iteratively if certain events are not match to
>>>> the condition provided
>>>> by the user those events will be removed when N%windowSize = 0, during
>>>> this I output those events as expired-events in the window. For the
>>>> incoming events if event is already exists I output those as current-events
>>>> and if its a new
>>>> event I just add it to the data-structure and iterate through all the
>>>> events available and find the matching events based on s and e values and
>>>> only those events will be output.
>>>>
>>>> I am not sure based on the window definition this is the correct
>>>> approach or may be windows aren't the best way to associate this algorithm
>>>> in to siddhi. I have attached my patch to jira[2] and if you can look that
>>>> would be great.
>>>>
>>>> Siddhi Query will looks like below,
>>>> from  cseEventStream#window.lossyFrequent(0.1,0.01) " +
>>>>                                                        "select symbol,
>>>> price " +
>>>>                                                        "insert into
>>>> StockQuote;
>>>>
>>>>
>>>> [1]Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency
>>>> counts over data streams. In Proceedings of the 28th international
>>>> conference on Very Large Data Bases (VLDB '02). VLDB Endowment 346-357.
>>>>
>>>> Regards
>>>> Lahiru
>>>> On Jun 19, 2014, at 2:36 AM, Lahiru Gunathilake wrote:
>>>>
>>>> Hi All,
>>>> On Jun 17, 2014, at 1:49 AM, Lahiru Gunathilake wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I am planning to evaluate different event stream clustering algorithms
>>>> as part of my studies(I am a graduate student at indiana University). I
>>>> think Siddhi is a good place to experiment this, As per my understanding
>>>> based on the docs Siddhi doesn't have a stream clustering interface I can
>>>> use directly to plug my own algorithm. So I am thinking of first come up an
>>>> interface for different clustering algorithms and add implementation of
>>>> algorithms for each event stream by invoking an operation like
>>>> SiddhiManager.addQuery. Or I can make the algorithm configure as part of
>>>> query language. If the second option is more consistent with current model
>>>> I can wrap-up the work in that way but initially focussing on first
>>>> approach will be easier for me. So each algorithm can be associated to a
>>>> desired event Stream or can be associated globally. If its associated with
>>>> each stream algorithm will run local to each stream otherwise it will run
>>>> in global context. Based on the algorithm I can provide a way to configure
>>>> it with parameters.
>>>>
>>>> I am sure I have confused with above implementation details, after
>>>> looking in to Siddhi extension points I figured out I just have to
>>>> implement a new window type. I have implemented one algorithm to keep the
>>>> most frequent events
>>>> came in a event stream. So queries can looks like below,
>>>>
>>>> from  cseEventStream#window.frequent(2) " +
>>>>                                                        "select symbol,
>>>> price " +
>>>>                                                        "insert into
>>>> StockQuote;
>>>>
>>>> There are multiple algorithms to keep the most frequent events in a
>>>> given window size for now I just implemented a simple algorithm[1] with the
>>>> processing complexity of O(1) and space complexity O(n) where n is the
>>>> limit of the most frequent items. I have created a patch and attached it to
>>>> jira[2].
>>>>
>>>> [1] Jayadev, and David Gries Misra, "Finding repeated elements," in
>>>> Science of computer programming 2, no.
>>>> 2 (1982): 143-152.
>>>> [2]https://wso2.org/jira/browse/CEP-877
>>>>
>>>>
>>>> Thanks
>>>> Lahiru
>>>>
>>>> To start this I hope to implement a frequent item set mining algorithm
>>>> which can be used to find out most frequent items of an event stream.
>>>> Search engines use these kind of data to find out most frequent searches in
>>>> a given time window and optimize the search queries. I can start with some
>>>> algorithms like Misra-Gries algorithm[1] and Manku and Motwani [2] and
>>>> then move towards more of data clustering algorithms. For the time being I
>>>> will write the clustering results in to a file and later I think I can use
>>>> more stable storage (either wso2 registry or other prefered way in wso2
>>>> product stack). If Siddhi or WSO2 CEP already have the capability of
>>>> frequent item mining I will start with a more classification type 
>>>> algorithm.
>>>>
>>>> Your feedback will be very useful for my work. If you have requirement
>>>> for any specific type of algorithms based on the real client interactions
>>>> you have, I would like to know them and implement them with Siddhi and do
>>>> the comparison.
>>>>
>>>> Thanks
>>>> Lahiru
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>>  _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>>  _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>>
>
>
> --
> *V. Mohanadarshan*
> *Software Engineer,*
> *Data Technologies Team,*
> *WSO2, Inc. http://wso2.com <http://wso2.com> *
> *lean.enterprise.middleware.*
>
> email: [email protected]
> phone:(+94) 771117673
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to