2013/3/11 Olivier Grisel <[email protected]>:
> 2013/3/11 Roman Sinayev <[email protected]>:
>> I got CountVectorizer about 2x faster without multiprocessing so far,
>> however I have a couple of questions.

I'm curious how you pulled that off.

>> 1. Why do we not use max_df and min_df and max_features when custom
>> vocabulary is provided?
>
> It seems weird to me to mutate the dictionary of the caller. I would
> not expect that.

Agree.

>>             """
>>             # store map from term name to feature integer index: we sort the
>>             # terms to have reproducible outcome for the vocabulary 
>> structure:
>>             # otherwise the mapping from feature name to indices might depend
>>             # on the memory layout of the machine. Furthermore sorted terms
>>             # might make it possible to perform binary search in the feature
>>             # names array.
>>            """
>> So if we store in the order first encountered, it wouldn't depend on
>> the memory layout.  It also avoids unnecessary matrix reorderings, so
>> about 10% faster overall.

I never really understood this comment. Memory layout? What memory layout?

> Indeed but then you need to store the feature names in an ordered dict
> instead of a regular dict.
> AFAIK this is not available in Python 2.6 (which we still support). We
> would have to backport an implementation in sklearn.utils.fixes for
> python 2.6 compat.
> Also ordered dict needs more memory than a dict.

I suspect OrderedDict will be pretty expensive since it implements a
linked list in pure Python (but I didn't measure).

>>  Why not have sorting as an option in the
>> API with default False?
>
> That sounds reasonable.

+0.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to