Forgot to mention that true labels are *not *present. Thus, I had chosen
*Silhouette* coefficient.
On Sun, Dec 8, 2013 at 5:02 PM, nipun batra <nip...@iiitd.ac.in> wrote:
> Hi,
>
> I am using *kmean++* to cluster my data series. From my domain expertise,
> I know that the number of cluster varies between 2 and 4. To find this
> *optimum* number of clusters, I was doing the following (pseudocode):
>
> for num_cluster in [2, 3, 4]:
> cluster_using_kmeans (num_cluster, data)
> find silhouette coefficient[num_cluster]
>
> Whichever *num_cluster* would give me the optimum silhouette score, would
> be the *optimum* number of clusters.
> <#142d1fb1450bb490_>Problem
>
> I end up with a *memory* error.
> Following is the complete stack trace.
>
>
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc
> in silhouette_samples(X, labels, metric, **kwds) 135 136 """
> --> 137 distances = pairwise_distances(X, metric=metric, **kwds) 138
> n = labels.shape[0] 139 A =
> np.array([_intra_cluster_distance(distances[i], labels, i)
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in
> pairwise_distances(X, Y, metric, n_jobs, **kwds) 485 func =
> pairwise_distance_functions[metric] 486 if n_jobs == 1: --> 487
> return func(X, Y, **kwds) 488 else: 489
> return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in
> euclidean_distances(X, Y, Y_norm_squared, squared) 172 # TODO: a
> faster Cython implementation would do the clipping of negative 173 #
> values in a single pass over the output matrix. --> 174 distances =
> safe_sparse_dot(X, Y.T, dense_output=True) 175 distances *= -2
> 176 distances += XX
> /usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in
> safe_sparse_dot(a, b, dense_output) 76 return ret 77
> else: ---> 78 return np.dot(a, b) 79 80 MemoryError:
>
> As far as I understand, this is due to the pairwise distance computation.
> I guess for the same reason, I went out of memory, with DBScan.
>
> My data is uni dimensional with shape (262271,1). I am using scikit-learn
> version 0.14.1
> My system configuration is the following
>
> RAM: 8 GB
> Processor: i7
> OS: Ubuntu 64 bit
> <#142d1fb1450bb490_>Questions
>
> 1. Is there any suggested better metric/cluster validity or scoring
> for finding optimum number of states in this case? If yes, is it going to
> give memory problems? Or, is there a workaround with Silhouette
> coefficient?
> 2. If one were to use DBScan for such datasets, it there some way out
> to avoid memory issues?
>
>
------------------------------------------------------------------------------
Sponsored by Intel(R) XDK
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general