Indeed, Joel: you are totally right. I absolutely misinterpreted the use of the encoder. Thanks Joel and Alex for having a look!
El Jueves, 20 de abril, 2017 12:46:52, Joel Nothman <joel.noth...@gmail.com> escribió: The problem is the misuse of the label encoder. See https://github.com/scikit-learn/scikit-learn/issues/8767 On 20 April 2017 at 19:58, Alex Garel <a...@garel.org> wrote: I'm not totally sure of what you're trying to do, but here are some remarks that may help you: 1. in modelfit = model.fit(count_vect, enc), the enc parameter is not used, only the count_vect matrix is used 2. when you use kneighbors you get vectors corresponding to wiki['text'] not to wiki['name'], so it seems very strange to use mod_enc.inverse_transform on it ! Maybe what you should better find those vectors in count_vect and read "name" at corresponding row in your dataframe. Hope it helps, Alex Le 16/04/2017 à 10:56, Evaristo Caraballo via scikit-learn a écrit : I have been asked to implement a simple knn for text similarity analysis. I tried by using sklearn.neighbors module. The file to be analysed consisted on 2 relevant columns: "text" and "name". The knn model should be fitted with bag-of-words of a corpus of around 60,000 pre-treated text fragments of about 200 words each. I used CounterVectorizer. As test I was asked to use the model to get the names in the "name" column related to 10 top text strings that are the closest to a pre-selected one that also exists in the corpus used to initialise the knn model. Similarity distance should be measured using an euclidean metric. I used the kneighbors function to obtain the closest neighbors. Below you can find the code I was trying to implement using kneighbors: import os, sys import sklearn import sklearn.neighbors as sk_neighbors from sklearn.feature_extraction.tex t import CountVectorizer import pandas import scipy import matplotlib.pyplot as plt import numpy as np %matplotlib inline wiki = pandas.read_csv('wiki_ filefragment.csv') mod_count_vect = CountVectorizer() count_vect = mod_count_vect.fit_transform(w iki['text']) print(count_vect.shape) mod_count_vect.get_feature_ names() mod_enc = sklearn.preprocessing.LabelEnc oder().fit(wiki['name']) enc = mod_enc.transform(wiki['name'] ) enc model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p = 2 ) #no matter what I use, it is always the same modelfit = model.fit(count_vect, enc) #also likely the kneighbors is not working? print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) ) This implementation gave me the following results for the first 10 nearest neighbors to 'Franz Rottensteiner': Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner (writer) The results continued to be far from being close to the test solution (which use Graphlab Create and SFrame), which are: Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane In fact, I tried a simple brute force implementation by iterating over the list of texts calculating distances with scipy and that gave me the expected results. The result was the same after also using Python 2.7. A link to the implementations (the one that doesn't work and the one that does) together a pick the file used for this test can be found on this Gist. Does anyone can suggest what it is wrong with my sklearn implementation? Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - Jupyter - sklearn 0.18 - pandas ______________________________ _________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/ mailman/listinfo/scikit-learn ______________________________ _________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/ mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn