[Scikit-learn-general] TF-IDF and LSI

Tasos Ventouris Thu, 26 Sep 2013 08:18:22 -0700

Hello, I have to questions where I would like your feedback.
The first one:
Here is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [doc1,doc2,doc3]tfidf = 
TfidfVectorizer().fit_transform(documents)pairwise_similarity = tfidf * 
tfidf.Tprint pairwise_similarity.A
Where doc1, doc2 and doc3 are plain text documents. I use the tf-idf to find 
their similiraty. My issue is that I want to use my own script to tokkenize the 
text, remove stop words and stemming the words. So, I want to find a way to use 
the above code but the doc1, doc2 and doc3 will be lists of the tokkenized text 
and when the TfidfVectorizer called, it won't make any changes related with the 
above.
I tried to run my script and then create a string from the list for each text 
and inlcude those texts into the TfidfVectorizer. I am satisfied from the 
results, but unfortunately, if I have 1000 or more documents, this isn't the 
most efficient way to do it. Can you think anything better?
The second one:
I want to do the same thing (document similarity) with LSI (Latent Semantic 
Indexing). I have used other libraries, also I tried to do it myself, but I am 
not fully satisfied. I want the same thing. A script where I will import 2-3 
documents and as an output the similarity matrix. Nothing more. Is there any 
way to do it with Scikit-learn?
Best Regards,Anastasios

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] TF-IDF and LSI

Reply via email to