Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20518#discussion_r166336855
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -745,4 +763,27 @@ private[spark] class CosineDistanceMeasure extends 
DistanceMeasure {
       override def distance(v1: VectorWithNorm, v2: VectorWithNorm): Double = {
         1 - dot(v1.vector, v2.vector) / v1.norm / v2.norm
       }
    +
    +  /**
    +   * Updates the value of `sum` adding the `point` vector.
    +   * @param point a `VectorWithNorm` to be added to `sum` of a cluster
    +   * @param sum the `sum` for a cluster to be updated
    +   */
    +  override def updateClusterSum(point: VectorWithNorm, sum: Vector): Unit 
= {
    +    axpy(1.0 / point.norm, point.vector, sum)
    +  }
    +
    +  /**
    +   * Returns a centroid for a cluster given its `sum` vector and its 
`count` of points.
    +   *
    +   * @param sum   the `sum` for a cluster
    +   * @param count the number of points in the cluster
    +   * @return the centroid of the cluster
    +   */
    +  override def centroid(sum: Vector, count: Long): VectorWithNorm = {
    +    scal(1.0 / count, sum)
    +    val norm = Vectors.norm(sum, 2)
    --- End diff --
    
    Rather than scale `sum` twice, can you just compute its normal and then 
scale by 1 / (norm * count * count)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to