Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20396#discussion_r167590799
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
    @@ -421,13 +456,220 @@ private[evaluation] object 
SquaredEuclideanSilhouette {
           computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: 
Double, _: Double)
         }
     
    -    val silhouetteScore = dfWithSquaredNorm
    -      .select(avg(
    -        computeSilhouetteCoefficientUDF(
    -          col(featuresCol), col(predictionCol).cast(DoubleType), 
col("squaredNorm"))
    -      ))
    -      .collect()(0)
    -      .getDouble(0)
    +    val silhouetteScore = overallScore(dfWithSquaredNorm,
    +      computeSilhouetteCoefficientUDF(col(featuresCol), 
col(predictionCol).cast(DoubleType),
    +        col("squaredNorm")))
    +
    +    bClustersStatsMap.destroy()
    +
    +    silhouetteScore
    +  }
    +}
    +
    +
    +/**
    + * The algorithm which is implemented in this object, instead, is an 
efficient and parallel
    --- End diff --
    
    there are 2 reason of my sentence, let me know if it is not clear:
     1. the whole algorithm (all the math steps described) assumes that we are 
using the cosine distance; for a different distance measure, the algorithm is 
not valid, even though you can use the same approach to do something similar 
(as it is done for the squared Euclidean, above);
     2. for some distance measure it is not possible to use this approach - ie. 
aggregate the score of the points - because math doesn't allow that. For 
instance, you can't use this approach with the Euclidean distance, since the 
sqrt prevents any possible aggregation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to