Hello, I have to questions where I would like your feedback.
The first one:
Here is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [doc1,doc2,doc3]tfidf =
TfidfVectorizer().fit_transform(documents)pairwise_similarity = tfidf *
tfidf.Tprint pairwise_similarity.A
Where doc1, doc2 and doc3 are plain text documents. I use the tf-idf to find
their similiraty. My issue is that I want to use my own script to tokkenize the
text, remove stop words and stemming the words. So, I want to find a way to use
the above code but the doc1, doc2 and doc3 will be lists of the tokkenized text
and when the TfidfVectorizer called, it won't make any changes related with the
above.
I tried to run my script and then create a string from the list for each text
and inlcude those texts into the TfidfVectorizer. I am satisfied from the
results, but unfortunately, if I have 1000 or more documents, this isn't the
most efficient way to do it. Can you think anything better?
The second one:
I want to do the same thing (document similarity) with LSI (Latent Semantic
Indexing). I have used other libraries, also I tried to do it myself, but I am
not fully satisfied. I want the same thing. A script where I will import 2-3
documents and as an output the similarity matrix. Nothing more. Is there any
way to do it with Scikit-learn?
Best Regards,Anastasios
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general