[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608437#comment-14608437
 ] 

ASF GitHub Bot commented on FLINK-2131:
---------------------------------------

GitHub user sachingoel0101 reopened a pull request:

    https://github.com/apache/flink/pull/757

    [FLINK-2131][ml]: Initialization schemes for k-means clustering

    This adds two most common initialization strategies for the k-means 
clustering algorithm, namely, Random initialization and kmeans++ initialization.
    Further details are at https://issues.apache.org/jira/browse/FLINK-2131
    [Edit]: Work on kmeans|| has been started and just needs to be finalized.
    [Edit]: kmeans|| implementation finished. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachingoel0101/flink 
clustering_initializations

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/757.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #757
    
----
commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6
Author: Sachin Goel <[email protected]>
Date:   2015-06-02T06:44:30Z

    Random and kmeans++ initialization methods added

commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e
Author: Sachin Goel <[email protected]>
Date:   2015-06-02T15:42:58Z

    Merge https://github.com/apache/flink into clustering_initializations

commit cdbb3a0801d364935d455798c695f4615ae74e76
Author: Sachin Goel <[email protected]>
Date:   2015-06-02T19:49:24Z

    Merge https://github.com/apache/flink into clustering_initializations

commit 7496e21462e4efc0813450971ae6cbc94d2b2c15
Author: Sachin Goel <[email protected]>
Date:   2015-06-02T22:41:20Z

    Initialization costs of random and kmeans++ added

commit 8033c87b71686bd3955281db12583592549406cb
Author: Sachin Goel <[email protected]>
Date:   2015-06-05T21:54:10Z

    Merge https://github.com/apache/flink into clustering_initializations

commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8
Author: Sachin Goel <[email protected]>
Date:   2015-06-05T22:52:02Z

    Removed cost parameter from Algorithm itself. Leaving it to the user for 
now. Also added support for weighted input data sets

commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a
Author: Sachin Goel <[email protected]>
Date:   2015-06-06T05:04:55Z

    An initial draft of kmeans-par method

commit f3bfad4fc0c6576af14f1e981f8e778445856355
Author: Sachin Goel <[email protected]>
Date:   2015-06-08T10:36:32Z

    All three initialization schemes implemented and tested

commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc
Author: Sachin Goel <[email protected]>
Date:   2015-06-08T10:36:58Z

    Merge https://github.com/apache/flink into clustering_initializations

commit 3765a3e6a77a8bdbac21d03be1c43263925b1495
Author: Sachin Goel <[email protected]>
Date:   2015-06-30T08:57:41Z

    Merge remote-tracking branch 'upstream/master' into 
clustering_initializations

----


> Add Initialization schemes for K-means clustering
> -------------------------------------------------
>
>                 Key: FLINK-2131
>                 URL: https://issues.apache.org/jira/browse/FLINK-2131
>             Project: Flink
>          Issue Type: Task
>          Components: Machine Learning Library
>            Reporter: Sachin Goel
>            Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to