2013/9/29 Tasos Ventouris <[email protected]>:
> I am trying to create a script to compute the similarity for only two
> documents. I wrote this code but if I use two docs on the data set, the
> results is a 2x2 matrix with [[1,0],[0,1]]. If I use more than 2 documents,
> the results are almost correct. Any suggestion?

Have you inspected the vocabulary of the vectorizer? Do you have any
reason to think the documents are similar in any way?

>  def lsa(doc1,doc2):
>     dataset = [doc1,doc2]
>     vectorizer = TfidfVectorizer(stop_words='english')
>     X = vectorizer.fit_transform(dataset)
>     lsa = TruncatedSVD(n_components=100)
>     X = lsa.fit_transform(X)
>     X = Normalizer(copy=False).fit_transform(X)
>
>     return cosine_similarity(X)

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to