[scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

Evaristo Caraballo via scikit-learn Sun, 16 Apr 2017 04:20:40 -0700

I have been asked to implement a simple knn for text similarity analysis. I 
tried by using sklearn.neighbors module.The file to be analysed consisted on 2 
relevant columns: "text" and "name".The knn model should be fitted with 
bag-of-words of a corpus of around 60,000 pre-treated text fragments of about 
200 words each. I used CounterVectorizer.As test I was asked to use the model 
to get the names in the "name" column related to 10 top text strings that are 
the closest to a pre-selected one that also exists in the corpus used to 
initialise the knn model. Similarity distance should be measured using an 
euclidean metric.I used the kneighbors function to obtain the closest 
neighbors.Below you can find the code I was trying to implement using 
kneighbors:import os, sys
import sklearn
import sklearn.neighbors as sk_neighbors
from sklearn.feature_extraction.text import CountVectorizer
import pandas
import scipy
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


wiki = pandas.read_csv('wiki_filefragment.csv')

mod_count_vect = CountVectorizer()
count_vect = mod_count_vect.fit_transform(wiki['text'])
print(count_vect.shape)
mod_count_vect.get_feature_names()

mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name'])
enc = mod_enc.transform(wiki['name'])
enc

model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute',  p = 
2 ) #no matter what I use, it is always the same
modelfit = model.fit(count_vect, enc)

#also likely the kneighbors is not working?
print( mod_enc.inverse_transform( modelfit.kneighbors( 
count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, 
return_distance=False ) ) )This implementation gave me the following results 
for the first 10 nearest neighbors to 'Franz Rottensteiner':
Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel , M. 
G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, Andrea 
Foulkes, Alan W. Meerow, John Warner (writer)
The results continued to be far from being close to the test solution (which 
use Graphlab Create and SFrame), which are:
Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 Hurst, 
Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus Campbell, 
Antonello Bonci, Henkjan Honing, Joseph Born Kadane
In fact, I tried a simple brute force implementation by iterating over the list 
of texts calculating distances with scipy and that gave me the expected 
results. The result was the same after also using Python 2.7.A link to the 
implementations (the one that doesn't work and the one that does) together a 
pick the file used for this test can be found on this Gist.Does anyone can 
suggest what it is wrong with my sklearn implementation?Relevant resources are: 
- Anaconda Python3.5 (with a virtenv using 2.7) - Jupyter - sklearn 0.18 - 
pandas

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

Reply via email to