Hi,
I am using *kmean++* to cluster my data series. From my domain expertise, I
know that the number of cluster varies between 2 and 4. To find this
*optimum* number of clusters, I was doing the following (pseudocode):
for num_cluster in [2, 3, 4]:
cluster_using_kmeans (num_cluster, data)
find silhouette coefficient[num_cluster]
Whichever *num_cluster* would give me the optimum silhouette score, would
be the *optimum* number of clusters.
<#>Problem
I end up with a *memory* error.
Following is the complete stack trace.
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc
in silhouette_samples(X, labels, metric, **kwds) 135 136
""" --> 137 distances = pairwise_distances(X, metric=metric,
**kwds) 138 n = labels.shape[0] 139 A =
np.array([_intra_cluster_distance(distances[i], labels, i)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in
pairwise_distances(X, Y, metric, n_jobs, **kwds) 485 func
= pairwise_distance_functions[metric] 486 if n_jobs == 1:
--> 487 return func(X, Y, **kwds) 488 else:
489 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc
in euclidean_distances(X, Y, Y_norm_squared, squared) 172 #
TODO: a faster Cython implementation would do the clipping of negative
173 # values in a single pass over the output matrix. --> 174
distances = safe_sparse_dot(X, Y.T, dense_output=True) 175
distances *= -2 176 distances += XX
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in
safe_sparse_dot(a, b, dense_output) 76 return ret 77
else: ---> 78 return np.dot(a, b) 79 80
MemoryError:
As far as I understand, this is due to the pairwise distance computation. I
guess for the same reason, I went out of memory, with DBScan.
My data is uni dimensional with shape (262271,1). I am using scikit-learn
version 0.14.1
My system configuration is the following
RAM: 8 GB
Processor: i7
OS: Ubuntu 64 bit
<#>Questions
1. Is there any suggested better metric/cluster validity or scoring for
finding optimum number of states in this case? If yes, is it going to give
memory problems? Or, is there a workaround with Silhouette coefficient?
2. If one were to use DBScan for such datasets, it there some way out to
avoid memory issues?
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general