Re: [Scikit-learn-general] problem clustering using PCA and kmeans

Aliabbas Petiwala Sun, 09 Sep 2012 04:20:29 -0700

Thanks Olivier that helped to show me the output, but for the same code as
given before i am not getting proper clusters  as shown in the plot below
there are no clearly disparate clusters , the points seems to overlap. But
using heirarchical clustering on same dataset i did find about 7 disparate
clusters.


output for kmeans:https://www.sugarsync.com/pf/D6041365_9522679_000759
output for dendrogram:https://www.sugarsync.com/pf/D6041365_9522679_002521

On Sun, Sep 9, 2012 at 2:48 AM, Olivier Grisel <[email protected]>wrote:

> 2012/9/8 Aliabbas Petiwala <[email protected]>:
> > Hi,
> > i am trying to cluster a list of text docs based on similarity by first
> > identifying the clusters using PCA and then proceeding with a kmeans
> using
> > the results of PCA as shown below. tHE PROBLEM is that the kmeans does
> > output the 3 clusters but the plot function fails to display the
> clustering
> > results. The plot only shows one dot on which all cluster centers
> > overlapping. What i want is a simple cluster diagram visualizing the
> > clusters representing each doc as a dot on the plot and each dot labelled
> > with the doc name or number.
> >
> > vectorizer = TfidfVectorizer(max_features=50, max_df=0.5,
> >
> > stop_words='english',charset_error='ignore')
> >     proptable = vectorizer.fit_transform([eereader.raw(f) for f in
> > eereader.fileids()])
> >     X=proptable.todense()
> >
> >     pca = PCA(n_components=2).fit(X)
> >     X_pca = pca.transform(X)
> >     print
> >
> pca.components_,'\nvar=',pca.explained_variance_,'\nratio=\n',pca.explained_variance_ratio_
> >
> >     kmeans = KMeans(3).fit(X_pca)
> >
> >     print 'clusters:',kmeans.cluster_centers_
> >
> >     plot_2D(X_pca, [1,2,3], ['group1','group2','group4'])
> >
> > def plot_2D(data, target, target_names):
> >      colors = cycle('rgbcmykw')
> >      target_ids = range(len(target_names))
> >      pylab.figure()
> >      for i, c, label in zip(target_ids, colors, target_names):
> >          pylab.scatter(data[target == i, 0], data[target == i, 1],
> >                     c=c, label=label)
> >      pylab.legend()
> >      pylab.show()
>
> You pass `target=[1, 2, 3]` instead of `target=kmeans.labels_` to your
> plot function as target should have the same shape[0] as data.
>
> Furthermore:
>
> target_ids = range(len(target_names))
>
> is equivalent to:
>
> target_ids = [0, 1, 2]
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
*


Aliabbas Petiwala| Phd Scholar|Interdisciplinary Program in Education
Technology**|IIT *
**
**
*Bombay|+919664867707 | http://home.iitb.ac.in/~aliabbas/*

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] problem clustering using PCA and kmeans

Reply via email to