I tried but TfIDF is slow after the vectorization. The other thing was since it is stateless, wouldn't transformation of a test corpus followed by tfidf result in a totally different matrix? You won't know which words are responsible for what.
>On 03/07/2013 09:13 AM, Roman Sinayev wrote: >> This module is a crucial bottleneck in NLP problems. I am trying to >> refactor it and also make it parallel across documents with python >> multiprocessing module. Is anyone else working on this? >If this is your bottleneck, you should consider using HashingVectorizer: >http://scikit-learn.org/dev/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
