I got CountVectorizer about 2x faster without multiprocessing so far,
however I have a couple of questions.
1. Why do we not use max_df and min_df and max_features when custom
vocabulary is provided?
Some people may provide a huge vocabulary, but they wouldn't be
interested in some words if they're very frequent etc.
2. I am not sure why sorting is necessary. In my implementation I'm
storing features on the first-encountered basis. In the code comments
it says
"""
# store map from term name to feature integer index: we sort the
# terms to have reproducible outcome for the vocabulary structure:
# otherwise the mapping from feature name to indices might depend
# on the memory layout of the machine. Furthermore sorted terms
# might make it possible to perform binary search in the feature
# names array.
"""
So if we store in the order first encountered, it wouldn't depend on
the memory layout. It also avoids unnecessary matrix reorderings, so
about 10% faster overall. Why not have sorting as an option in the
API with default False?
On Thu, Mar 7, 2013 at 5:17 AM, Lars Buitinck <[email protected]> wrote:
> 2013/3/7 Andreas Mueller <[email protected]>:
>> On 03/07/2013 09:40 AM, Roman Sinayev wrote:
>>> I tried but TfIDF is slow after the vectorization. The other thing
>>> was since it is stateless, wouldn't transformation of a test corpus
>>> followed by tfidf result in a totally different matrix? You won't
>>> know which words are responsible for what.
>>>
>> Yes, it does give different results. But it is way more scalable.
>> I think there have been several attempts at speeding up the
>> DictVectorizer using
>> multi-processing, iirc without much success.
>
> CountVectorizer. But yes, I tried that, and it got much slower. Feel
> free to try again, and if multiprocessing doesn't work, you can even
> try threads, since the vectorizers may interleave I/O and computation.
>
> **BUT**: be sure to profile first to find the weak spots. There's a
> few loops in the vectorizer that might be better handled in Cython
> than pure Python.
>
> (CountVectorizer actually got 10% slower when we pulled a patch that
> reduces its memory usage.)
>
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general