[Scikit-learn-general] Finding 'optimum' number of cluster assignments (some domain information present)

nipun batra Sun, 08 Dec 2013 03:35:27 -0800

Hi,

I am using *kmean++* to cluster my data series. From my domain expertise, I
know that the number of cluster varies between 2 and 4. To find this
*optimum* number of clusters, I was doing the following (pseudocode):


for num_cluster in [2, 3, 4]:
    cluster_using_kmeans (num_cluster, data)
    find silhouette coefficient[num_cluster]

Whichever *num_cluster* would give me the optimum silhouette score, would
be the *optimum* number of clusters.
 <#>Problem

I end up with a *memory* error.
Following is the complete stack trace.


/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc
in silhouette_samples(X, labels, metric, **kwds)     135      136
""" --> 137     distances = pairwise_distances(X, metric=metric,
**kwds)     138     n = labels.shape[0]     139     A =
np.array([_intra_cluster_distance(distances[i], labels, i)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in
pairwise_distances(X, Y, metric, n_jobs, **kwds)     485         func
= pairwise_distance_functions[metric]     486         if n_jobs == 1:
--> 487             return func(X, Y, **kwds)     488         else:
 489             return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
 /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc
in euclidean_distances(X, Y, Y_norm_squared, squared)     172     #
TODO: a faster Cython implementation would do the clipping of negative
    173     # values in a single pass over the output matrix. --> 174
   distances = safe_sparse_dot(X, Y.T, dense_output=True)     175
distances *= -2     176     distances += XX
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in
safe_sparse_dot(a, b, dense_output)      76         return ret      77
    else: ---> 78         return np.dot(a, b)      79       80
MemoryError:

As far as I understand, this is due to the pairwise distance computation. I
guess for the same reason, I went out of memory, with DBScan.

My data is uni dimensional with shape (262271,1). I am using scikit-learn
version 0.14.1
My system configuration is the following

RAM: 8 GB
Processor: i7
OS: Ubuntu 64 bit
 <#>Questions

   1. Is there any suggested better metric/cluster validity or scoring for
   finding optimum number of states in this case? If yes, is it going to give
   memory problems? Or, is there a workaround with Silhouette coefficient?
   2. If one were to use DBScan for such datasets, it there some way out to
   avoid memory issues?

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Finding 'optimum' number of cluster assignments (some domain information present)

Reply via email to