Julian Jorczik created SPARK-34664:
--------------------------------------
Summary: Provide silhouette score for each sample when using
ClusteringEvaluator
Key: SPARK-34664
URL: https://issues.apache.org/jira/browse/SPARK-34664
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 3.1.1
Reporter: Julian Jorczik
Computing the average silhouette score is already implemented when using
ClusteringEvaluator. When looking at the [source
code|https://gitlab.com/mark91/SparkClusteringEvaluationMetrics/-/blob/master/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouetteEvaluator.scala]
of ClusteringEvaluator, I think it would be easy to provide not only the
average silhouette score but also the silhouette score for each sample, as they
are already computed (Line 95-99).
The silhouette score for each sample can be helpful to generate a silhouette
plot for instance as described in [this scikit-learn
article|https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html].
The resulting feature would be equivalent to the silhouette_samples function
implemented in scikit-learn.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]