2012/9/8 Aliabbas Petiwala <[email protected]>:
> Hi,
> i am trying to cluster a list of text docs based on similarity by first
> identifying the clusters using PCA and then proceeding with a kmeans using
> the results of PCA as shown below. tHE PROBLEM is that the kmeans does
> output the 3 clusters but the plot function fails to display the clustering
> results. The plot only shows one dot on which all cluster centers
> overlapping. What i want is a simple cluster diagram visualizing the
> clusters representing each doc as a dot on the plot and each dot labelled
> with the doc name or number.
>
> vectorizer = TfidfVectorizer(max_features=50, max_df=0.5,
>
> stop_words='english',charset_error='ignore')
>     proptable = vectorizer.fit_transform([eereader.raw(f) for f in
> eereader.fileids()])
>     X=proptable.todense()
>
>     pca = PCA(n_components=2).fit(X)
>     X_pca = pca.transform(X)
>     print
> pca.components_,'\nvar=',pca.explained_variance_,'\nratio=\n',pca.explained_variance_ratio_
>
>     kmeans = KMeans(3).fit(X_pca)
>
>     print 'clusters:',kmeans.cluster_centers_
>
>     plot_2D(X_pca, [1,2,3], ['group1','group2','group4'])
>
> def plot_2D(data, target, target_names):
>      colors = cycle('rgbcmykw')
>      target_ids = range(len(target_names))
>      pylab.figure()
>      for i, c, label in zip(target_ids, colors, target_names):
>          pylab.scatter(data[target == i, 0], data[target == i, 1],
>                     c=c, label=label)
>      pylab.legend()
>      pylab.show()

You pass `target=[1, 2, 3]` instead of `target=kmeans.labels_` to your
plot function as target should have the same shape[0] as data.

Furthermore:

target_ids = range(len(target_names))

is equivalent to:

target_ids = [0, 1, 2]

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to