2013/9/29 Tasos Ventouris <[email protected]>: > I am trying to create a script to compute the similarity for only two > documents. I wrote this code but if I use two docs on the data set, the > results is a 2x2 matrix with [[1,0],[0,1]]. If I use more than 2 documents, > the results are almost correct. Any suggestion?
Have you inspected the vocabulary of the vectorizer? Do you have any reason to think the documents are similar in any way? > def lsa(doc1,doc2): > dataset = [doc1,doc2] > vectorizer = TfidfVectorizer(stop_words='english') > X = vectorizer.fit_transform(dataset) > lsa = TruncatedSVD(n_components=100) > X = lsa.fit_transform(X) > X = Normalizer(copy=False).fit_transform(X) > > return cosine_similarity(X) ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
