[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608437#comment-14608437
]
ASF GitHub Bot commented on FLINK-2131:
---------------------------------------
GitHub user sachingoel0101 reopened a pull request:
https://github.com/apache/flink/pull/757
[FLINK-2131][ml]: Initialization schemes for k-means clustering
This adds two most common initialization strategies for the k-means
clustering algorithm, namely, Random initialization and kmeans++ initialization.
Further details are at https://issues.apache.org/jira/browse/FLINK-2131
[Edit]: Work on kmeans|| has been started and just needs to be finalized.
[Edit]: kmeans|| implementation finished.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachingoel0101/flink
clustering_initializations
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/757.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #757
----
commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6
Author: Sachin Goel <[email protected]>
Date: 2015-06-02T06:44:30Z
Random and kmeans++ initialization methods added
commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e
Author: Sachin Goel <[email protected]>
Date: 2015-06-02T15:42:58Z
Merge https://github.com/apache/flink into clustering_initializations
commit cdbb3a0801d364935d455798c695f4615ae74e76
Author: Sachin Goel <[email protected]>
Date: 2015-06-02T19:49:24Z
Merge https://github.com/apache/flink into clustering_initializations
commit 7496e21462e4efc0813450971ae6cbc94d2b2c15
Author: Sachin Goel <[email protected]>
Date: 2015-06-02T22:41:20Z
Initialization costs of random and kmeans++ added
commit 8033c87b71686bd3955281db12583592549406cb
Author: Sachin Goel <[email protected]>
Date: 2015-06-05T21:54:10Z
Merge https://github.com/apache/flink into clustering_initializations
commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8
Author: Sachin Goel <[email protected]>
Date: 2015-06-05T22:52:02Z
Removed cost parameter from Algorithm itself. Leaving it to the user for
now. Also added support for weighted input data sets
commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a
Author: Sachin Goel <[email protected]>
Date: 2015-06-06T05:04:55Z
An initial draft of kmeans-par method
commit f3bfad4fc0c6576af14f1e981f8e778445856355
Author: Sachin Goel <[email protected]>
Date: 2015-06-08T10:36:32Z
All three initialization schemes implemented and tested
commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc
Author: Sachin Goel <[email protected]>
Date: 2015-06-08T10:36:58Z
Merge https://github.com/apache/flink into clustering_initializations
commit 3765a3e6a77a8bdbac21d03be1c43263925b1495
Author: Sachin Goel <[email protected]>
Date: 2015-06-30T08:57:41Z
Merge remote-tracking branch 'upstream/master' into
clustering_initializations
----
> Add Initialization schemes for K-means clustering
> -------------------------------------------------
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
> Issue Type: Task
> Components: Machine Learning Library
> Reporter: Sachin Goel
> Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However,
> in case the user doesn't provide the initial centers, they may ask for a
> particular initialization scheme to be followed. The most commonly used are
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is
> preferred as it provides the same approximation guarantees as kmeans++ and
> requires lesser number of passes over the input data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)