[GitHub] [flink-ml] lindong28 commented on a diff in pull request #110: [FLINK-27096] Optimize KMeans performance

GitBox Thu, 16 Jun 2022 07:51:08 -0700


lindong28 commented on code in PR #110:
URL: https://github.com/apache/flink-ml/pull/110#discussion_r899170104



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/clustering/kmeans/KMeans.java:
##########
@@ -307,24 +266,41 @@ public void processElement2(StreamRecord<DenseVector[]> 
streamRecord) throws Exc
 
         @Override
         public void onEpochWatermarkIncremented(
-                int epochWatermark, Context context, Collector<Tuple2<Integer, 
DenseVector>> out)
+                int epochWatermark,
+                Context context,
+                Collector<Tuple2<Integer[], DenseVector[]>> out)
                 throws Exception {
             DenseVector[] centroidValues =
                     Objects.requireNonNull(
                             OperatorStateUtils.getUniqueElement(centroids, 
"centroids")
                                     .orElse(null));
+
+            DenseVector[] newCentroids = new 
DenseVector[centroidValues.length];
+            int[] counts = new int[centroidValues.length];
+            for (int i = 0; i < centroidValues.length; i++) {
+                newCentroids[i] = new DenseVector(centroidValues[i].size());
+            }
+
             for (DenseVector point : points.get()) {
                 int closestCentroidId =
                         findClosestCentroidId(centroidValues, point, 
distanceMeasure);
-                output.collect(new StreamRecord<>(Tuple2.of(closestCentroidId, 
point)));
+
+                BLAS.axpy(1.0, point, newCentroids[closestCentroidId]);
+                counts[closestCentroidId]++;
             }
 
+            output.collect(
+                    new StreamRecord<>(
+                            Tuple2.of(
+                                    
Arrays.stream(counts).boxed().toArray(Integer[]::new),

Review Comment:
   `Arrays.stream(...)` is generally much slower than an iterative approach as 
of now [1]. If the iterative approach is not too complex to write, how about we 
use the iterative approach here?
   
   [1] 
https://stackoverflow.com/questions/27925954/is-arrays-streamarray-name-sum-slower-than-iterative-approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-ml] lindong28 commented on a diff in pull request #110: [FLINK-27096] Optimize KMeans performance

Reply via email to