I'm not totally sure of what you're trying to do, but here are some remarks that may help you:
1. in modelfit = model.fit(count_vect, enc), the enc parameter is not used, only the count_vect matrix is used 2. when you use kneighbors you get vectors corresponding to wiki['text'] not to wiki['name'], so it seems very strange to use mod_enc.inverse_transform on it ! Maybe what you should better find those vectors in count_vect and read "name" at corresponding row in your dataframe. Hope it helps, Alex Le 16/04/2017 à 10:56, Evaristo Caraballo via scikit-learn a écrit : > I have been asked to implement a simple knn for text similarity > analysis. I tried by using sklearn.neighbors module. > The file to be analysed consisted on 2 relevant columns: "text" and > "name". > The knn model should be fitted with bag-of-words of a corpus of around > 60,000 pre-treated text fragments of about 200 words each. I used > CounterVectorizer. > As test I was asked to use the model to get the names in the "name" > column related to 10 top text strings that are the closest to a > pre-selected one that also exists in the corpus used to initialise the > knn model. Similarity distance should be measured using an euclidean > metric. > I used the kneighbors function to obtain the closest neighbors. > Below you can find the code I was trying to implement using kneighbors: > |importos,sys importsklearn importsklearn.neighbors assk_neighbors > fromsklearn.feature_extraction.text importCountVectorizerimportpandas > importscipy importmatplotlib.pyplot asplt importnumpy asnp %matplotlib > inline wiki =pandas.read_csv('wiki_filefragment.csv')mod_count_vect > =CountVectorizer()count_vect > =mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape)mod_count_vect.get_feature_names()mod_enc > =sklearn.preprocessing.LabelEncoder().fit(wiki['name'])enc > =mod_enc.transform(wiki['name'])enc model > =sk_neighbors.NearestNeighbors(n_neighbors=10,algorithm='brute',p > =2)#no matter what I use, it is always the samemodelfit > =model.fit(count_vect,enc)#also likely the kneighbors is not > working?print(mod_enc.inverse_transform(modelfit.kneighbors(count_vect[mod_enc.transform(['Franz > Rottensteiner'])],n_neighbors=11,return_distance=False)))| > This implementation gave me the following results for the first 10 > nearest neighbors to 'Franz Rottensteiner': > > Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, > Tofusquirrel , M. G. Sheftall, Peter Maurer, Allan Weisbecker, > Ferdinand Knobloch, Andrea Foulkes, Alan W. Meerow, John Warner > (writer) > > The results continued to be far from being close to the test solution > (which use Graphlab Create and SFrame), which are: > > Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, > Andr%C3%A9 Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. > Meerow, John Angus Campbell, Antonello Bonci, Henkjan Honing, > Joseph Born Kadane > > In fact, I tried a simple brute force implementation by iterating over > the list of texts calculating distances with scipy and that gave me > the expected results. The result was the same after also using Python 2.7. > A link to the implementations (the one that doesn't work and the one > that does) together a pick the file used for this test can be found on > this Gist > <https://gist.github.com/evaristoc/eb2f2d91524b874c4db6638359e32b0f>. > Does anyone can suggest what it is wrong with my sklearn implementation? > Relevant resources are: - Anaconda Python3.5 (with a virtenv using > 2.7) - Jupyter - sklearn 0.18 - pandas > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn
signature.asc
Description: OpenPGP digital signature
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn