2013/5/25 Mike Hansen <[email protected]>:
> As suggested in the subject line, I am attempting to perform a nearest
> neighbors analysis (unsupervised, I hope) on documents I collected from
> around the web. I have read through Sci-Kit's kNN documentation, I am
> working from the Nearest Centroid Example, and based on this documentation's
> notes section I am confident that Sci-Kit's kNN can classify text.
You're confusing k-NN and nearest centroids. The latter is always
supervised. What you need is the "raw" nearest neighbors algorithm
[1], which gives you the k nearest neighbors of a point (kneighbors
method) or the neighbors within some radius (radius_neighbors).
> Format of X. In the Nearest Centroid Example, X ("X = iris.data[:, :2]")
> returns a 2D array that looks like this: [...[ 6.7 3.3] [ 6.7 3. ] [ 6.3
> 2.5] [ 6.5 3. ][ 6.2 3.4] [ 5.9 3. ]]. Is it possible for me to
> transform my text-based values (likely either the similarity or distance
> measurement) into a comparable 2D array? If so, how so? If not, what would
> you recommend? Can I still complete a nearest neighbors analysis on my
> documents?
The format of X is a Numpy array or Scipy sparse matrix of shape
(n_samples, n_features). TfidfVectorizer returns such a matrix for
text documents, using the bag-of-words assumption (one feature per
unique word in the training set).
> Value of y. Just below X in the Nearest Centroid Example is y ("y =
> iris.target"). It returns what appears to be a 1D array of nearest neighbor
> indices. Is that the case? If not, what is it? And most importantly,
> irrespective of its value, how can I get a similar value out of my text
> documents?
In an unsupervised problem, there's no y. It holds the target labels,
aka the training signal, for supervised learning.
**However**, be warned that the space (!) complexity of getting the k
nearest training samples for a collection of n_test documents where
the training set had n_train documents is currently O(n_test ×
n_train), so you need to feed this small batches of test samples when
the training set is large. If you want your algorithm to scale,
consider using something like Lucene instead: index the training set,
then use the test documents as queries against and pick the k best
matches. Lucene is also a better storage solution than scikit-learn (I
don't think anyone will take offense at this remark :)
[1]
http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.NearestNeighbors.html
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general