Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @holdenk I am not sure I got 100% what you meant, so I'll try to answer but 
let me know if I missed something.
    
    The problem of doing 2 passes is related to cluster centers. The API of 
`ClusteringEvaluator` (as of any `Evaluator`) is very simple: it is has a 
method which gets a `Dataset` and returns a value. So, unlike the method here - 
which is part of the `KMeansModel` and it can get the cluster centers from it 
-, there is no clue about the cluster centers: computing them is easy but it 
requires a pass on the dataset (this is the extra pass I mentioned).
    
    An alternative to this is adding a `setClusterCenters` method on the 
`ClusteringEvaluator`, but I am not sure whether this is worth since they are 
needed only for this metric, while for the others so far (the Silhouette 
measure) they are useless. Moreover, this metric was introduced explicitly as a 
temp fix because we were missing any other (better) evaluation metric and it 
was supposed to be dismissed once a better evaluation metric would have been 
introduced (please see the related JIRA and PR). So I am not sure that 
introducing a new method specifically for this metric is a good idea.
    
    What do you think? Were you suggesting this second option?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to