2013/3/7 Andreas Mueller <[email protected]>: > On 03/07/2013 09:40 AM, Roman Sinayev wrote: >> I tried but TfIDF is slow after the vectorization. The other thing >> was since it is stateless, wouldn't transformation of a test corpus >> followed by tfidf result in a totally different matrix? You won't >> know which words are responsible for what. >> > Yes, it does give different results. But it is way more scalable. > I think there have been several attempts at speeding up the > DictVectorizer using > multi-processing, iirc without much success.
CountVectorizer. But yes, I tried that, and it got much slower. Feel free to try again, and if multiprocessing doesn't work, you can even try threads, since the vectorizers may interleave I/O and computation. **BUT**: be sure to profile first to find the weak spots. There's a few loops in the vectorizer that might be better handled in Cython than pure Python. (CountVectorizer actually got 10% slower when we pulled a patch that reduces its memory usage.) -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
