The problem is the misuse of the label encoder. See https://github.com/scikit-learn/scikit-learn/issues/8767
On 20 April 2017 at 19:58, Alex Garel <a...@garel.org> wrote: > I'm not totally sure of what you're trying to do, but here are some > remarks that may help you: > > 1. in modelfit = model.fit(count_vect, enc), the enc parameter is not > used, only the count_vect matrix is used > 2. when you use kneighbors you get vectors corresponding to wiki['text'] > not to wiki['name'], so it seems very strange to use > mod_enc.inverse_transform on it ! > > Maybe what you should better find those vectors in count_vect and read > "name" at corresponding row in your dataframe. > > Hope it helps, > > Alex > > > Le 16/04/2017 à 10:56, Evaristo Caraballo via scikit-learn a écrit : > > I have been asked to implement a simple knn for text similarity analysis. > I tried by using sklearn.neighbors module. > The file to be analysed consisted on 2 relevant columns: "text" and "name". > The knn model should be fitted with bag-of-words of a corpus of around > 60,000 pre-treated text fragments of about 200 words each. I used > CounterVectorizer. > As test I was asked to use the model to get the names in the "name" column > related to 10 top text strings that are the closest to a pre-selected one > that also exists in the corpus used to initialise the knn model. Similarity > distance should be measured using an euclidean metric. > I used the kneighbors function to obtain the closest neighbors. > Below you can find the code I was trying to implement using kneighbors: > > import os, sysimport sklearnimport sklearn.neighbors as sk_neighborsfrom > sklearn.feature_extraction.text import CountVectorizerimport pandasimport > scipyimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline > > wiki = pandas.read_csv('wiki_filefragment.csv') > > mod_count_vect = CountVectorizer() > count_vect = mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape) > mod_count_vect.get_feature_names() > > mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name']) > enc = mod_enc.transform(wiki['name']) > enc > > model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute', p > = 2 ) #no matter what I use, it is always the same > modelfit = model.fit(count_vect, enc) > #also likely the kneighbors is not working?print( mod_enc.inverse_transform( > modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] > )], n_neighbors=11, return_distance=False ) ) ) > > This implementation gave me the following results for the first 10 nearest > neighbors to 'Franz Rottensteiner': > > Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel > , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch, > Andrea Foulkes, Alan W. Meerow, John Warner (writer) > > The results continued to be far from being close to the test solution > (which use Graphlab Create and SFrame), which are: > > Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9 > Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus > Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane > > In fact, I tried a simple brute force implementation by iterating over the > list of texts calculating distances with scipy and that gave me the > expected results. The result was the same after also using Python 2.7. > A link to the implementations (the one that doesn't work and the one that > does) together a pick the file used for this test can be found on this > Gist <https://gist.github.com/evaristoc/eb2f2d91524b874c4db6638359e32b0f>. > Does anyone can suggest what it is wrong with my sklearn implementation? > Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) - > Jupyter - sklearn 0.18 - pandas > > > _______________________________________________ > scikit-learn mailing > listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn