Re: [Scikit-learn-general] TF-IDF and LSI

Lars Buitinck Thu, 26 Sep 2013 08:41:16 -0700

2013/9/26 Olivier Grisel <olivier.gri...@ensta.org>:
> 2013/9/7 Tasos Ventouris <tasosventou...@hotmail.com>:
>> I tried to run my script and then create a string from the list for each
>> text and inlcude those texts into the TfidfVectorizer. I am satisfied from
>> the results, but unfortunately, if I have 1000 or more documents, this isn't
>> the most efficient way to do it. Can you think anything better?
>
> Whats is the problem with 1000 documents? TfidfVectorizer should be
> able to process several millions of characters (bytes) per second on a
> single CPU.


Maybe the matrix multiplication tfidf * tfidf.T, which doesn't scale
up (O(n³) worst case, probably closer to ~n² for document-term
matrices)? But then 1000 is still not a lot.

>> I want to do the same thing (document similarity) with LSI (Latent Semantic
>> Indexing). I have used other libraries, also I tried to do it myself, but I
>> am not fully satisfied. I want the same thing. A script where I will import
>> 2-3 documents and as an output the similarity matrix. Nothing more. Is there
>> any way to do it with Scikit-learn?
>
> You will need more than 3 document to extract an interesting latent
> space. You can compute the LSI components (really just a truncated SVD
> on the bag of words representation) on a large corpus (e.g. a random
> subset of wikipedia text for instance) and then project your 3
> documents into that space to compute the similarities.

That's sklearn.decomposition.TruncatedSVD, see [1]

Btw., the rule of thumb with LSI is that it tends to be better than
raw tf-idf in the 10k-100k documents range, at least for ranking
purposes. That's a range that TfidfVectorizer can handle quite nicely.

[1] 
http://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-IDF and LSI

Reply via email to