2013/5/25 Mike Hansen <[email protected]>:
> As suggested in the subject line, I am attempting to perform a nearest
> neighbors analysis (unsupervised, I hope) on documents I collected from
> around the web.  I have read through Sci-Kit's kNN documentation, I am
> working from the Nearest Centroid Example, and based on this documentation's
> notes section I am confident that Sci-Kit's kNN can classify text.

You're confusing k-NN and nearest centroids. The latter is always
supervised. What you need is the "raw" nearest neighbors algorithm
[1], which gives you the k nearest neighbors of a point (kneighbors
method) or the neighbors within some radius (radius_neighbors).

> Format of X. In the Nearest Centroid Example, X ("X = iris.data[:, :2]")
> returns a 2D array that looks like this: [...[ 6.7  3.3] [ 6.7  3. ] [ 6.3
> 2.5] [ 6.5  3. ][ 6.2  3.4]   [ 5.9  3. ]].  Is it possible for me to
> transform my text-based values (likely either the similarity or distance
> measurement) into a comparable 2D array?  If so, how so?  If not, what would
> you recommend?  Can I still complete a nearest neighbors analysis on my
> documents?

The format of X is a Numpy array or Scipy sparse matrix of shape
(n_samples, n_features). TfidfVectorizer returns such a matrix for
text documents, using the bag-of-words assumption (one feature per
unique word in the training set).

> Value of y.  Just below X in the Nearest Centroid Example is y ("y =
> iris.target").  It returns what appears to be a 1D array of nearest neighbor
> indices.  Is that the case?  If not, what is it?  And most importantly,
> irrespective of its value, how can I get a similar value out of my text
> documents?

In an unsupervised problem, there's no y. It holds the target labels,
aka the training signal, for supervised learning.

**However**,  be warned that the space (!) complexity of getting the k
nearest training samples for a collection of n_test documents where
the training set had n_train documents is currently O(n_test ×
n_train), so you need to feed this small batches of test samples when
the training set is large. If you want your algorithm to scale,
consider using something like Lucene instead: index the training set,
then use the test documents as queries against and pick the k best
matches. Lucene is also a better storage solution than scikit-learn (I
don't think anyone will take offense at this remark :)

[1] 
http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.NearestNeighbors.html

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to