[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540465#comment-14540465
 ] 

Peter Schrott commented on FLINK-1731:
--------------------------------------

Very nice. The implementation of BreezeVector and EuclideanDistanceMetrics 
works out just fine. Thanks for the support on that.

There is another open question:
1) How are the initial centroids to be passed to the algorithm. We implemented 
the KMeans as an derivative of Learner. As there is only one argument to pass 
(the dataset), should we set the initial centroids as parameter. (We do the 
same for the number of iterations)
2) Should the initial centroids passed as a DataSet or Seq? Are there any side 
effects regarding parallelism when using the DataSet type? 

> Add kMeans clustering algorithm to machine learning library
> -----------------------------------------------------------
>
>                 Key: FLINK-1731
>                 URL: https://issues.apache.org/jira/browse/FLINK-1731
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Alexander Alexandrov
>              Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to