As part of evaluating cluster quality, I'd like to implement a bunch of
quality measures, especially external ones.

The one that I think would be particularly useful is the Adjusted Rand
Index [1].
Using a contingency table with the partitions from 2 clusterings, this
returns a value from 0 to 1 (higher being better) corresponding to the
similarity of the partitions.

First of all, I'd like to know your thought on using ARI as a metric.

Second, there's an implementation of ConfusionMatrix that is NxN. I'd like
to extend this class to support unlabeled partitions of different sizes and
add a method that computes the ARI.

What are your thoughts?

[1] http://en.wikipedia.org/wiki/Rand_index

Reply via email to