Re: [Scikit-learn-general] TF-IDF and LSI

Olivier Grisel Thu, 26 Sep 2013 08:26:23 -0700

2013/9/7 Tasos Ventouris <[email protected]>:
> Hello, I have to questions where I would like your feedback.
>
> The first one:
>
> Here is my code:
>
> from sklearn.feature_extraction.text import TfidfVectorizer
>
> documents = [doc1,doc2,doc3]
> tfidf = TfidfVectorizer().fit_transform(documents)
> pairwise_similarity = tfidf * tfidf.T
> print pairwise_similarity.A
>
> Where doc1, doc2 and doc3 are plain text documents. I use the tf-idf to find
> their similiraty. My issue is that I want to use my own script to tokkenize
> the text, remove stop words and stemming the words. So, I want to find a way
> to use the above code but the doc1, doc2 and doc3 will be lists of the
> tokkenized text and when the TfidfVectorizer called, it won't make any
> changes related with the above.
>
> I tried to run my script and then create a string from the list for each
> text and inlcude those texts into the TfidfVectorizer. I am satisfied from
> the results, but unfortunately, if I have 1000 or more documents, this isn't
> the most efficient way to do it. Can you think anything better?


Whats is the problem with 1000 documents? TfidfVectorizer should be
able to process several millions of characters (bytes) per second on a
single CPU.

Also the IDF weights of TFIDF is will be just noisy on a corpus of
just 3 documents.

> The second one:
>
> I want to do the same thing (document similarity) with LSI (Latent Semantic
> Indexing). I have used other libraries, also I tried to do it myself, but I
> am not fully satisfied. I want the same thing. A script where I will import
> 2-3 documents and as an output the similarity matrix. Nothing more. Is there
> any way to do it with Scikit-learn?

You will need more than 3 document to extract an interesting latent
space. You can compute the LSI components (really just a truncated SVD
on the bag of words representation) on a large corpus (e.g. a random
subset of wikipedia text for instance) and then project your 3
documents into that space to compute the similarities.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-IDF and LSI

Reply via email to