2013/9/7 Tasos Ventouris <tasosventou...@hotmail.com>: > Hello, I have to questions where I would like your feedback. > > The first one: > > Here is my code: > > from sklearn.feature_extraction.text import TfidfVectorizer > > documents = [doc1,doc2,doc3] > tfidf = TfidfVectorizer().fit_transform(documents) > pairwise_similarity = tfidf * tfidf.T > print pairwise_similarity.A > > Where doc1, doc2 and doc3 are plain text documents. I use the tf-idf to find > their similiraty. My issue is that I want to use my own script to tokkenize > the text, remove stop words and stemming the words. So, I want to find a way > to use the above code but the doc1, doc2 and doc3 will be lists of the > tokkenized text and when the TfidfVectorizer called, it won't make any > changes related with the above. > > I tried to run my script and then create a string from the list for each > text and inlcude those texts into the TfidfVectorizer. I am satisfied from > the results, but unfortunately, if I have 1000 or more documents, this isn't > the most efficient way to do it. Can you think anything better?
Whats is the problem with 1000 documents? TfidfVectorizer should be able to process several millions of characters (bytes) per second on a single CPU. Also the IDF weights of TFIDF is will be just noisy on a corpus of just 3 documents. > The second one: > > I want to do the same thing (document similarity) with LSI (Latent Semantic > Indexing). I have used other libraries, also I tried to do it myself, but I > am not fully satisfied. I want the same thing. A script where I will import > 2-3 documents and as an output the similarity matrix. Nothing more. Is there > any way to do it with Scikit-learn? You will need more than 3 document to extract an interesting latent space. You can compute the LSI components (really just a truncated SVD on the bag of words representation) on a large corpus (e.g. a random subset of wikipedia text for instance) and then project your 3 documents into that space to compute the similarities. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general