Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Roman Sinayev Mon, 11 Mar 2013 13:12:42 -0700

I got CountVectorizer about 2x faster without multiprocessing so far,
however I have a couple of questions.


1. Why do we not use max_df and min_df and max_features when custom
vocabulary is provided?
Some people may provide a huge vocabulary, but they wouldn't be
interested in some words if they're very frequent etc.

2. I am not sure why sorting is necessary.  In my implementation I'm
storing features on the first-encountered basis. In the code comments
it says
            """
            # store map from term name to feature integer index: we sort the
            # terms to have reproducible outcome for the vocabulary structure:
            # otherwise the mapping from feature name to indices might depend
            # on the memory layout of the machine. Furthermore sorted terms
            # might make it possible to perform binary search in the feature
            # names array.
           """
So if we store in the order first encountered, it wouldn't depend on
the memory layout.  It also avoids unnecessary matrix reorderings, so
about 10% faster overall.  Why not have sorting as an option in the
API with default False?

On Thu, Mar 7, 2013 at 5:17 AM, Lars Buitinck <[email protected]> wrote:
> 2013/3/7 Andreas Mueller <[email protected]>:
>> On 03/07/2013 09:40 AM, Roman Sinayev wrote:
>>> I tried but TfIDF is slow after the vectorization.  The other thing
>>> was since it is stateless, wouldn't transformation of a test corpus
>>> followed by tfidf result in a totally different matrix?  You won't
>>> know which words are responsible for what.
>>>
>> Yes, it does give different results. But it is way more scalable.
>> I think there have been several attempts at speeding up the
>> DictVectorizer using
>> multi-processing, iirc without much success.
>
> CountVectorizer. But yes, I tried that, and it got much slower. Feel
> free to try again, and if multiprocessing doesn't work, you can even
> try threads, since the vectorizers may interleave I/O and computation.
>
> **BUT**: be sure to profile first to find the weak spots. There's a
> few loops in the vectorizer that might be better handled in Cython
> than pure Python.
>
> (CountVectorizer actually got 10% slower when we pulled a patch that
> reduces its memory usage.)
>
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Reply via email to