[
https://issues.apache.org/jira/browse/FLINK-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15104984#comment-15104984
]
Till Rohrmann commented on FLINK-3245:
--------------------------------------
Agreed to what the k-means algorithm does [~fhueske] but I think Kay's question
arises from a strange convention on our side.
The k-means data generator generates a sample of points drawn from multiple
Gaussian distributions. The means of the Gaussian distributions are sampled
from a uniform distribution over the specified range. Additionally, the program
generates another file called {{centers}}. This file does not contain, as one
would assume according to the file's name, the original centers of the Gaussian
distributions, but newly sampled points from a uniform distribution. These
points are then used by the k-means implementation to initialize the
computation.
I think that this does not make much sense and is confusing. Instead the
initial centroids should be generated from the input data set (the sampled
points) and not from a dedicated centers file which has nothing to do with the
actual distribution. Furthermore, I think it would make sense to store the
original centers in the {{centers}} file so that one can compare the computed
centers to it.
> KMeans Data Generator writes not the same centroids as it was used for the
> dataset
> ----------------------------------------------------------------------------------
>
> Key: FLINK-3245
> URL: https://issues.apache.org/jira/browse/FLINK-3245
> Project: Flink
> Issue Type: Bug
> Reporter: Kay
> Priority: Trivial
>
> Hey guys.
> I am using your really nice KMeans dataset generator. I am wondering what
> actually is the reason you write out not the same centers as the data
> generator has used for the generated dataset.
> org.apache.flink.examples.java.clustering.util.KMeansDataGenerator
> LINE 126
> Cheers
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)