[ 
https://issues.apache.org/jira/browse/FLINK-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15104984#comment-15104984
 ] 

Till Rohrmann commented on FLINK-3245:
--------------------------------------

Agreed to what the k-means algorithm does [~fhueske] but I think Kay's question 
arises from a strange convention on our side.

The k-means data generator generates a sample of points drawn from multiple 
Gaussian distributions. The means of the Gaussian distributions are sampled 
from a uniform distribution over the specified range. Additionally, the program 
generates another file called {{centers}}. This file does not contain, as one 
would assume according to the file's name, the original centers of the Gaussian 
distributions, but newly sampled points from a uniform distribution. These 
points are then used by the k-means implementation to initialize the 
computation.

I think that this does not make much sense and is confusing. Instead the 
initial centroids should be generated from the input data set (the sampled 
points) and not from a dedicated centers file which has nothing to do with the 
actual distribution. Furthermore, I think it would make sense to store the 
original centers in the {{centers}} file so that one can compare the computed 
centers to it.

> KMeans Data Generator writes not the same centroids as it was used for the 
> dataset
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-3245
>                 URL: https://issues.apache.org/jira/browse/FLINK-3245
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Kay
>            Priority: Trivial
>
> Hey guys.
> I am using your really nice KMeans dataset generator. I am wondering what 
> actually is the reason you write out not the same centers as the data 
> generator has used for the generated dataset.
> org.apache.flink.examples.java.clustering.util.KMeansDataGenerator
> LINE 126
> Cheers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to