Hi Lars,

Thank you very much for this response. Please excuse my questions since I'm
new.

>From here the document on TfidfVectorizer here:

http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Does TfidfVectorizer take a sequence of filenames, where each file is just
a plain text file ?

Also, according to the link, I thought min_df only lives in the interval
[0.0, 1.0] or did I misunderstand ? If it's an "int" does that recommend
the number of occurrences rather than the frequency ?

Thank you,

Harold


On Sun, Jun 2, 2013 at 4:14 AM, Lars Buitinck <l.j.buiti...@uva.nl> wrote:

> 2013/6/1 Harold Nguyen <har...@nexgate.com>:
> > I was wondering if anyone can point me to a tutorial on clustering text
> > documents, but then also displaying the results in a graph ? I see some
> > examples on clustering text documents, but I'd like to be able to
> visualize
> > the clusters.
>
> You'll need dimensionality reduction to be able to plot the results.
> The following is a short example script that uses RandomizedPCA to
> perform a singular value decomposition on tf-idf vectors and plots a
> scatterplot. Colors indicate cluster membership.
>
>
> import sys
>
> import matplotlib.pyplot as plt
> from sklearn.cluster import KMeans
> from sklearn.decomposition import RandomizedPCA
> from sklearn.feature_extraction.text import TfidfVectorizer
>
> v = TfidfVectorizer(input='filename', min_df=2)
> X = v.fit_transform(sys.argv[1:])
> km = KMeans(n_clusters=4).fit(X)
> svd = RandomizedPCA(n_components=2)
> X2 = svd.fit_transform(X)
>
> plt.scatter(X2[:, 0], X2[:, 1], c=km.labels_)
> plt.show()
>
>
> Note that after dimensionality reduction, the clusters are no longer
> easy to spot. Alternatively, you can cluster X2 directly for a
> prettier plot, but then make sure you measure the quality of the
> clustering (see examples/document_clustering.py in the scikit-learn
> sources) both with and without the PCA.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to