lindong28 commented on code in PR #110:
URL: https://github.com/apache/flink-ml/pull/110#discussion_r899170104
##########
flink-ml-lib/src/main/java/org/apache/flink/ml/clustering/kmeans/KMeans.java:
##########
@@ -307,24 +266,41 @@ public void processElement2(StreamRecord<DenseVector[]>
streamRecord) throws Exc
@Override
public void onEpochWatermarkIncremented(
- int epochWatermark, Context context, Collector<Tuple2<Integer,
DenseVector>> out)
+ int epochWatermark,
+ Context context,
+ Collector<Tuple2<Integer[], DenseVector[]>> out)
throws Exception {
DenseVector[] centroidValues =
Objects.requireNonNull(
OperatorStateUtils.getUniqueElement(centroids,
"centroids")
.orElse(null));
+
+ DenseVector[] newCentroids = new
DenseVector[centroidValues.length];
+ int[] counts = new int[centroidValues.length];
+ for (int i = 0; i < centroidValues.length; i++) {
+ newCentroids[i] = new DenseVector(centroidValues[i].size());
+ }
+
for (DenseVector point : points.get()) {
int closestCentroidId =
findClosestCentroidId(centroidValues, point,
distanceMeasure);
- output.collect(new StreamRecord<>(Tuple2.of(closestCentroidId,
point)));
+
+ BLAS.axpy(1.0, point, newCentroids[closestCentroidId]);
+ counts[closestCentroidId]++;
}
+ output.collect(
+ new StreamRecord<>(
+ Tuple2.of(
+
Arrays.stream(counts).boxed().toArray(Integer[]::new),
Review Comment:
`Arrays.stream(...)` is generally much slower than an iterative approach as
of now [1]. If the iterative approach is not too complex to write, how about we
use the iterative approach here?
[1]
https://stackoverflow.com/questions/27925954/is-arrays-streamarray-name-sum-slower-than-iterative-approach
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]