Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Roman Sinayev Thu, 07 Mar 2013 00:41:28 -0800

I tried but TfIDF is slow after the vectorization.  The other thing
was since it is stateless, wouldn't transformation of a test corpus
followed by tfidf result in a totally different matrix?  You won't
know which words are responsible for what.


>On 03/07/2013 09:13 AM, Roman Sinayev wrote:
>> This module is a crucial bottleneck in NLP problems. I am trying to
>> refactor it and also make it parallel across documents with python
>> multiprocessing module.  Is anyone else working on this?
>If this is your bottleneck, you should consider using HashingVectorizer:
>http://scikit-learn.org/dev/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Reply via email to