Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/20629
@holdenk I am not sure about requiring or not cluster centers for this
metric. On one side, since the `ClusteringEvaluator` should be a general
interface for all clustering models and some of them don't provide cluster
centers, it may be a good idea to compute them if necessary. On the other, does
this metric make sense for any model other than KMeans? And computing the
centers of the test dataset would lead to different results than the old API we
are replacing. So I am not sure it is the right thing to do.
Honestly, the more we go on the more my feeling is that we don't really
need to move that metric here. We can just deprecate it saying that there are
better metrics for evaluating a clustering available in the
`ClusteringEvaluator` (namely the silhouette). In these way people can move
away from using this metric.
Moreover, sklearn - which is one of the most widespread tool - doesn't
offer the ability of computing such a cost
(http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation).
The only thing sklearn offers is what it calls `inertia`
(https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265),
ie. the cost computed on the training set.
So, I think the best option would be to follow what sklearn does:
1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer)
the cost attribute on the training set
2 - deprecate this method redirecting to `ClusteringEvaluator` for better
metrics and/or to the cost attribute introduced
What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]