Re: [Scikit-learn-general] Finding 'optimum' number of cluster assignments (some domain information present)

nipun batra Sun, 08 Dec 2013 03:48:22 -0800

Forgot to mention that true labels are *not *present. Thus, I had chosen
*Silhouette* coefficient.



On Sun, Dec 8, 2013 at 5:02 PM, nipun batra <nip...@iiitd.ac.in> wrote:

> Hi,
>
> I am using *kmean++* to cluster my data series. From my domain expertise,
> I know that the number of cluster varies between 2 and 4. To find this
> *optimum* number of clusters, I was doing the following (pseudocode):
>
> for num_cluster in [2, 3, 4]:
>     cluster_using_kmeans (num_cluster, data)
>     find silhouette coefficient[num_cluster]
>
> Whichever *num_cluster* would give me the optimum silhouette score, would
> be the *optimum* number of clusters.
>  <#142d1fb1450bb490_>Problem
>
> I end up with a *memory* error.
> Following is the complete stack trace.
>
>
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc
>  in silhouette_samples(X, labels, metric, **kwds)     135      136     """ 
> --> 137     distances = pairwise_distances(X, metric=metric, **kwds)     138  
>    n = labels.shape[0]     139     A = 
> np.array([_intra_cluster_distance(distances[i], labels, i)  
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in 
> pairwise_distances(X, Y, metric, n_jobs, **kwds)     485         func = 
> pairwise_distance_functions[metric]     486         if n_jobs == 1: --> 487   
>           return func(X, Y, **kwds)     488         else:     489             
> return _parallel_pairwise(X, Y, func, n_jobs, **kwds)  
> /usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.pyc in 
> euclidean_distances(X, Y, Y_norm_squared, squared)     172     # TODO: a 
> faster Cython implementation would do the clipping of negative     173     # 
> values in a single pass over the output matrix. --> 174     distances = 
> safe_sparse_dot(X, Y.T, dense_output=True)     175     distances *= -2     
> 176     distances += XX  
> /usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in 
> safe_sparse_dot(a, b, dense_output)      76         return ret      77     
> else: ---> 78         return np.dot(a, b)      79       80   MemoryError:
>
> As far as I understand, this is due to the pairwise distance computation.
> I guess for the same reason, I went out of memory, with DBScan.
>
> My data is uni dimensional with shape (262271,1). I am using scikit-learn
> version 0.14.1
> My system configuration is the following
>
> RAM: 8 GB
> Processor: i7
> OS: Ubuntu 64 bit
>  <#142d1fb1450bb490_>Questions
>
>    1. Is there any suggested better metric/cluster validity or scoring
>    for finding optimum number of states in this case? If yes, is it going to
>    give memory problems? Or, is there a workaround with Silhouette 
> coefficient?
>    2. If one were to use DBScan for such datasets, it there some way out
>    to avoid memory issues?
>
>

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Finding 'optimum' number of cluster assignments (some domain information present)

Reply via email to