[GitHub] [flink-ml] zhipeng93 commented on a change in pull request #70: [FLINK-26313] Add Transformer and Estimator of OnlineKMeans

GitBox Sun, 27 Mar 2022 21:00:44 -0700


zhipeng93 commented on a change in pull request #70:
URL: https://github.com/apache/flink-ml/pull/70#discussion_r836037690




##########
File path: 
flink-ml-lib/src/test/java/org/apache/flink/ml/clustering/KMeansTest.java
##########
@@ -177,11 +177,20 @@ public void testFewerDistinctPointsThanCluster() {
         KMeans kmeans = new KMeans().setK(2);
         KMeansModel model = kmeans.fit(input);
         Table output = model.transform(input)[0];
-        List<Set<DenseVector>> expectedGroups =
-                
Collections.singletonList(Collections.singleton(Vectors.dense(0.0, 0.1)));
-        List<Set<DenseVector>> actualGroups =
-                executeAndCollect(output, kmeans.getFeaturesCol(), 
kmeans.getPredictionCol());
-        assertTrue(CollectionUtils.isEqualCollection(expectedGroups, 
actualGroups));
+
+        try {

Review comment:
       I agree with the definition of `The max number of clusters to 
create...`. 
   
   If there are fewer distinct points than clusters, I would suggest not to 
create `k` centers by duplicating some data points for the following two 
reasons:
   - Existing libraries like Spark ML/Alink are not doing this.
   - There is no known use case for making it `k` centers with some identical 
cluster centers. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-ml] zhipeng93 commented on a change in pull request #70: [FLINK-26313] Add Transformer and Estimator of OnlineKMeans

Reply via email to