Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Lars Buitinck Thu, 07 Mar 2013 02:18:10 -0800

2013/3/7 Andreas Mueller <[email protected]>:
> On 03/07/2013 09:40 AM, Roman Sinayev wrote:
>> I tried but TfIDF is slow after the vectorization.  The other thing
>> was since it is stateless, wouldn't transformation of a test corpus
>> followed by tfidf result in a totally different matrix?  You won't
>> know which words are responsible for what.
>>
> Yes, it does give different results. But it is way more scalable.
> I think there have been several attempts at speeding up the
> DictVectorizer using
> multi-processing, iirc without much success.


CountVectorizer. But yes, I tried that, and it got much slower. Feel
free to try again, and if multiprocessing doesn't work, you can even
try threads, since the vectorizers may interleave I/O and computation.

**BUT**: be sure to profile first to find the weak spots. There's a
few loops in the vectorizer that might be better handled in Cython
than pure Python.

(CountVectorizer actually got 10% slower when we pulled a patch that
reduces its memory usage.)


-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Reply via email to