Re: [Scikit-learn-general] Clustering of Text Documents

Lars Buitinck Sun, 02 Jun 2013 04:15:58 -0700

2013/6/1 Harold Nguyen <har...@nexgate.com>:
> I was wondering if anyone can point me to a tutorial on clustering text
> documents, but then also displaying the results in a graph ? I see some
> examples on clustering text documents, but I'd like to be able to visualize
> the clusters.


You'll need dimensionality reduction to be able to plot the results.
The following is a short example script that uses RandomizedPCA to
perform a singular value decomposition on tf-idf vectors and plots a
scatterplot. Colors indicate cluster membership.


import sys

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import RandomizedPCA
from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer(input='filename', min_df=2)
X = v.fit_transform(sys.argv[1:])
km = KMeans(n_clusters=4).fit(X)
svd = RandomizedPCA(n_components=2)
X2 = svd.fit_transform(X)

plt.scatter(X2[:, 0], X2[:, 1], c=km.labels_)
plt.show()


Note that after dimensionality reduction, the clusters are no longer
easy to spot. Alternatively, you can cluster X2 directly for a
prettier plot, but then make sure you measure the quality of the
clustering (see examples/document_clustering.py in the scikit-learn
sources) both with and without the PCA.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Clustering of Text Documents

Reply via email to