[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

mgaido91 Mon, 25 Sep 2017 11:35:28 -0700

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19340#discussion_r140861096
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
    @@ -546,10 +574,88 @@ object KMeans {
           .run(data)
       }
     
    +  private[spark] def validateInitMode(initMode: String): Boolean = {
    +    initMode match {
    +      case KMeans.RANDOM => true
    +      case KMeans.K_MEANS_PARALLEL => true
    +      case _ => false
    +    }
    +  }
    +  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
    +    distanceMeasure match {
    +      case DistanceSuite.EUCLIDEAN => true
    +      case DistanceSuite.COSINE => true
    +      case _ => false
    +    }
    +  }
    +}
    +
    +/**
    + * A vector with its norm for fast distance computation.
    + *
    + * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
    + */
    +private[clustering]
    +class VectorWithNorm(val vector: Vector, val norm: Double) extends 
Serializable {
    +
    +  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
    +
    +  def this(array: Array[Double]) = this(Vectors.dense(array))
    +
    +  /** Converts the vector to a dense vector. */
    +  def toDense: VectorWithNorm = new 
VectorWithNorm(Vectors.dense(vector.toArray), norm)
    +}
    +
    +
    +private[spark] abstract class DistanceSuite extends Serializable {
    +
    +  /**
    +   * Returns the index of the closest center to the given point, as well 
as the squared distance.
    +   */
    +  def findClosest(
    --- End diff --
    
    Even though you are right in theory, if you look at the implementation for 
the euclidean distance, in the current code there is an optimization which 
doesn't use the real distance measure for performance reason. Thus, dropping 
this method and introducing a more generic one would cause a performance 
regression for the euclidean distance, which is something I'd definitely avoid.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

Reply via email to