The gap statistic is a popular method of choosing the "optimum" k in
k-means. I've written an implementation, but it doesn't appear to be
useful for my client's data, and I'd like to contribute it. I'm attaching
my script so you can see what I'm talking about. I know I have to document
it, and I need to parameterize the function.
The gap statistic algorithm works by generating a number B of random
datasets as references and clustering them each of them with 1, 2, ..., K
clusters. The actual data is also clustered with the same number of
clusters. The statistics for the reference datasets and the actual data
are compared, and the optimum value of k is chosen based on that comparison.
I want to make K (the maximum number of clusters) and B (the number of
reference datasets to generate) parameters of the function. Also, the user
ought to be able to choose between using KMeans or MiniBatchKMeans for the
clustering, but then he also has to be able to choose the parameters for
the clustering method. Unfortunately, the parameters (or their default
values) aren't quite the same for the same for KMeans and MiniBatchKMeans.
To avoid all these complications, I would have the user pass in a
clustering object, so the signature would be
gap(km, K=9, B=10)
or something like that, where km is either a KMeans or a MiniBatchKMeans
object.
The function returns (k, gap, s) where k is the optimal k, gap is the gap
statistic, and s is the standard deviation of the dataset dispersion.
Sometimes the method fails, and in that case, the k returned is None, but
gap and s are still returned.
The k returned is the smallest k such that gap[k] >= gap[k+1] - s[k+1].
Right now, the function works by computing all the statistics, and then
checking for an acceptable candidate. I should probably reorganize the
code to use test and generate instead, quitting when an acceptable k is
found.
Do you have any interest in this?
from __future__ import division
import numpy as np
from collections import Counter
from sklearn.cluster import MiniBatchKMeans
def gap(X):
'''
See http://web.stanford.edu/~hastie/Papers/gap.pdf
X has shape (n_samples, n_features)
'''
#Get the bounding box of the population
mins, maxes = np.min(X, axis = 0), np.max(X, axis = 0)
B = 10 # number of reference datasets (from uniform distribution)
K = 9 # maximum number of clusters
W = np.zeros(K) # dispersion of im with k+1 clusters
Wstar = np.zeros([B, K]) # dispersion of dataset b with k+1 clusters
mbk = MiniBatchKMeans()
for k in range(K):
mbk.n_clusters = k+1
mbk.fit(X)
W[k] = np.log(mbk.inertia_)
for b in range(B):
dataset = np.random.rand(*X.shape) * (maxes-mins) + mins
for k in range(K):
mbk.n_clusters = k+1
mbk.fit(dataset)
Wstar[b, k] = np.log(mbk.inertia_)
Gap = Wstar.mean(axis=0) - W
s = np.std(Wstar, axis=0) / np.sqrt(1+1/B)
try:
khat = 1+ min( [k for k in range (K-1) if Gap[k] >= Gap[k+1] - s[k+1]])
except ValueError:
khat = None
z = np.array([0])
Gap = np.hstack((z, Gap))
s = np.hstack((z, s))
return khat, Gap, s
if __name__ == '__main__':
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
k, g, s = gap(X)
print(k)
print(g)
print(s)
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general