2013/3/11 Roman Sinayev <[email protected]>:
> I got CountVectorizer about 2x faster without multiprocessing so far,
> however I have a couple of questions.
>
> 1. Why do we not use max_df and min_df and max_features when custom
> vocabulary is provided?
> Some people may provide a huge vocabulary, but they wouldn't be
> interested in some words if they're very frequent etc.

It seems weird to me to mutate the dictionary of the caller. I would
not expect that.

> 2. I am not sure why sorting is necessary.  In my implementation I'm
> storing features on the first-encountered basis. In the code comments
> it says
>             """
>             # store map from term name to feature integer index: we sort the
>             # terms to have reproducible outcome for the vocabulary structure:
>             # otherwise the mapping from feature name to indices might depend
>             # on the memory layout of the machine. Furthermore sorted terms
>             # might make it possible to perform binary search in the feature
>             # names array.
>            """
> So if we store in the order first encountered, it wouldn't depend on
> the memory layout.  It also avoids unnecessary matrix reorderings, so
> about 10% faster overall.

Indeed but then you need to store the feature names in an ordered dict
instead of a regular dict.
AFAIK this is not available in Python 2.6 (which we still support). We
would have to backport an implementation in sklearn.utils.fixes for
python 2.6 compat.

Also ordered dict needs more memory than a dict.

>  Why not have sorting as an option in the
> API with default False?

That sounds reasonable.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to