Re: [Architecture] Siddhi: K-means Clustering extension

Sachini Siriwardene Fri, 09 Jun 2017 04:19:01 -0700

Further implementation details of the extension :

 The extension is implemented extending the stream processor.


The input parameters for the function :

   1.

   data point to be clustered,
   2.

   no.of cluster centers,
   3.

   no. of iterations,
   4.

   no. of data points for which the model is trained -  x
   5.

   continueToTrain(boolean)



Output received :

   1.

   cluster centre value to which the data point belongs
   2.

   id of the particular cluster center
   3.

   distance from the cluster center.



We can use the clustering extension with a given window. The processing
details are as follows:

   1.

   For each current event in the window, each data point received is added
   to an arraylist.
   2.

   If the no. of data points received is greater than x, the cluster centre
   to which the data point belongs to is calculated and an output is produced.
   3.

   If the no. of data points received is a multiple of x, the data in the
   arraylist is sent to be clustered.
   4.

   If an expired event is received, the first item in the arraylist is
   removed.
   5.

   If a reset event is received, all the data in the arraylist is removed.
   6.

   If the continueToTrain parameter is false, the model will not be trained
   for each x number of events received. Instead it will only be trained for
   the first x number of events and the computed centres will be used to give
   the output for every event received afterwards.

The training process includes:

   1.

   Initializing the cluster centres based on the distinct number of first k
   data points in the data set. If distinct data points is less than the k
   value, the number of cluster centres will be initialized to distinct number
   of data points.
   2.

   The data points in the given data set is assigned to the available
   cluster centres.
   3.

   The new cluster centres are computed for the assigned data for each
   cluster center by taking the average value.
   4.

   The values in the data set are re assigned and cluster centres
   recomputed until the cluster centre values do not change or the number of
   iterations is reached.


On Fri, Jun 9, 2017 at 12:40 PM, Malith Jayasinghe <[email protected]> wrote:

> Should we add an option to enable/disable continuous learning? If "on"
> then training will happen after every x events otherwise only after first x
> events.
>
> On Fri, Jun 9, 2017 at 11:04 AM, Sachini Siriwardene <[email protected]>
> wrote:
>
>> Hi Fazlan,
>> Yes , that is what happens.
>>
>> On Fri, Jun 9, 2017 at 10:52 AM, Fazlan Nazeem <[email protected]> wrote:
>>
>>> Hi Sachini,
>>>
>>> Okay. I think I misread the "every x events" part previously. This
>>> means if x is 100 when 200 events have been received we would have 2 models
>>> in total. +1 if that is the case.
>>>
>>>
>>> On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe <[email protected]>
>>> wrote:
>>>
>>>> adding Fazlan
>>>>
>>>> On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Fazlan,
>>>>> Please find my replies inline.
>>>>>
>>>>> On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Malith,
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello All,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are developing k-means clustering extension. k-means is an
>>>>>>> unsupervised learning algorithm  which provides a simple way  to 
>>>>>>> classify a
>>>>>>> given data set through a certain number of clusters . The standard 
>>>>>>> k-means
>>>>>>> clustering algorithm is a nondeterministic algorithm. This means that we
>>>>>>> can get different results for the same input data when we run the 
>>>>>>> algorithm
>>>>>>> multiple times. The reason is that the algorithm randomly chooses k
>>>>>>> observations from the data set and uses these as the initial means.  
>>>>>>> Here
>>>>>>> we implement a variant of k means in which the initial cluster
>>>>>>> centers are determined by the first k distinct values. This will ensure 
>>>>>>> the
>>>>>>> same output for a given input.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Function Parameters: Data point to be clustered
>>>>>>>
>>>>>>> Number of cluster centers - k
>>>>>>>
>>>>>>> Number of iterations - m
>>>>>>>
>>>>>>> Number of events for which the model is trained - x
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The cluster centers are initialized based on the first distinct
>>>>>>> number of k (number of cluster centers) events in the stream.
>>>>>>>
>>>>>>> The model is trained for every x events received.
>>>>>>>
>>>>>>
>>>>>> Does this mean at any point in time, the maximum number of input
>>>>>> points used by the training process is x? Also how is the training 
>>>>>> process
>>>>>> carried out? I assume the training doesn't happen in real time.
>>>>>>
>>>>>
>>>>>   Training is carried out on the number of data points accumulated,
>>>>> depending on the window used.  The data is collected over a given window
>>>>> size, by updating an array list.
>>>>>
>>>>> Once an event is expired from the window, an element is removed from
>>>>> the array list.
>>>>>
>>>>>
>>>>>
>>>>> For every x number of data points received, the data accumulated in
>>>>> the array list is sent to be clustered and new cluster centers are
>>>>> computed. The training is carried out real time, for the data available in
>>>>> the array list at the time it is sent for clustering.
>>>>>
>>>>> The training process includes:
>>>>>
>>>>>    1.
>>>>>
>>>>>    Initializing the cluster centers based on the distinct number of
>>>>>    first k data points in the data set. If distinct data points is less 
>>>>> than
>>>>>    the k value, the number of cluster centers will be initialized to 
>>>>> distinct
>>>>>    number of data points.
>>>>>    2.
>>>>>
>>>>>    The data points in the given data set is assigned to the available
>>>>>    cluster centers.
>>>>>    3.
>>>>>
>>>>>    The new cluster centers are computed for the assigned data for
>>>>>    each cluster center by taking the average value.
>>>>>    4.
>>>>>
>>>>>    The values in the data set are re assigned and cluster centers
>>>>>    recomputed until the cluster center values do not change or the number 
>>>>> of
>>>>>    iterations is reached.
>>>>>
>>>>>
>>>>>
>>>>> An option can be given to train the model for only the first x number
>>>>> of events or train it for each x data points received.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> After receiving the first x events, an output is given for each event
>>>>>>> generated. The output consists of the cluster centre value to which the
>>>>>>> data point belongs, the id of the particular cluster center and the
>>>>>>> distance from the cluster center.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The clustering can be performed for a given window implementation
>>>>>>> i.e. time, time batch, length
>>>>>>>
>>>>>>> --
>>>>>>> Malith Jayasinghe
>>>>>>>
>>>>>>> WSO2, Inc. (http://wso2.com)
>>>>>>> Email   :[email protected]
>>>>>>> Mobile :0770704040
>>>>>>> Blog     :https://medium.com/@malith.jayasinghe
>>>>>>> <https://medium.com/@malith.jayasinghe>
>>>>>>> Lean . Enterprise . Middleware
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>>
>>>>>> Fazlan Nazeem
>>>>>>
>>>>>> *Senior Software Engineer*
>>>>>>
>>>>>> *WSO2 Inc*
>>>>>> Mobile : +94772338839
>>>>>> <%2B94%20%280%29%20773%20451194>
>>>>>> [email protected]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sachini Siriwardene
>>>>> Software Engineering Intern
>>>>>
>>>>> +94774274374 <+94%2077%20427%204374>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Malith Jayasinghe
>>>>
>>>> WSO2, Inc. (http://wso2.com)
>>>> Email   :[email protected]
>>>> Mobile :0770704040
>>>> Blog     :https://medium.com/@malith.jayasinghe
>>>> <https://medium.com/@malith.jayasinghe>
>>>> Lean . Enterprise . Middleware
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Fazlan Nazeem
>>>
>>> *Senior Software Engineer*
>>>
>>> *WSO2 Inc*
>>> Mobile : +94772338839
>>> <%2B94%20%280%29%20773%20451194>
>>> [email protected]
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> Sachini Siriwardene
>> Software Engineering Intern
>>
>> +94774274374 <+94%2077%20427%204374>
>>
>
>
>
> --
> Malith Jayasinghe
>
> WSO2, Inc. (http://wso2.com)
> Email   :[email protected]
> Mobile :0770704040
> Blog     :https://medium.com/@malith.jayasinghe
> <https://medium.com/@malith.jayasinghe>
> Lean . Enterprise . Middleware
>



-- 
Sachini Siriwardene
Software Engineering Intern

+94774274374

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] Siddhi: K-means Clustering extension

Reply via email to