2013/6/1 Harold Nguyen <har...@nexgate.com>: > I was wondering if anyone can point me to a tutorial on clustering text > documents, but then also displaying the results in a graph ? I see some > examples on clustering text documents, but I'd like to be able to visualize > the clusters.
You'll need dimensionality reduction to be able to plot the results. The following is a short example script that uses RandomizedPCA to perform a singular value decomposition on tf-idf vectors and plots a scatterplot. Colors indicate cluster membership. import sys import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.decomposition import RandomizedPCA from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer(input='filename', min_df=2) X = v.fit_transform(sys.argv[1:]) km = KMeans(n_clusters=4).fit(X) svd = RandomizedPCA(n_components=2) X2 = svd.fit_transform(X) plt.scatter(X2[:, 0], X2[:, 1], c=km.labels_) plt.show() Note that after dimensionality reduction, the clusters are no longer easy to spot. Alternatively, you can cluster X2 directly for a prettier plot, but then make sure you measure the quality of the clustering (see examples/document_clustering.py in the scikit-learn sources) both with and without the PCA. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general