Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Olivier Grisel Mon, 11 Mar 2013 14:20:39 -0700

2013/3/11 Lars Buitinck <[email protected]>:
> 2013/3/11 Olivier Grisel <[email protected]>:
>> 2013/3/11 Roman Sinayev <[email protected]>:
>>> I got CountVectorizer about 2x faster without multiprocessing so far,
>>> however I have a couple of questions.
>
> I'm curious how you pulled that off.
>
>>> 1. Why do we not use max_df and min_df and max_features when custom
>>> vocabulary is provided?
>>
>> It seems weird to me to mutate the dictionary of the caller. I would
>> not expect that.
>
> Agree.
>
>>>             """
>>>             # store map from term name to feature integer index: we sort the
>>>             # terms to have reproducible outcome for the vocabulary 
>>> structure:
>>>             # otherwise the mapping from feature name to indices might 
>>> depend
>>>             # on the memory layout of the machine. Furthermore sorted terms
>>>             # might make it possible to perform binary search in the feature
>>>             # names array.
>>>            """
>>> So if we store in the order first encountered, it wouldn't depend on
>>> the memory layout.  It also avoids unnecessary matrix reorderings, so
>>> about 10% faster overall.
>
> I never really understood this comment. Memory layout? What memory layout?


The internal state of the python dict used to store the vocabulary.
Maybe this is not the right word hand as more to do with the
randomization of the seed using the hash function of the python dict.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

Reply via email to