2013/3/11 Olivier Grisel <[email protected]>: > 2013/3/11 Roman Sinayev <[email protected]>: >> I got CountVectorizer about 2x faster without multiprocessing so far, >> however I have a couple of questions.
I'm curious how you pulled that off. >> 1. Why do we not use max_df and min_df and max_features when custom >> vocabulary is provided? > > It seems weird to me to mutate the dictionary of the caller. I would > not expect that. Agree. >> """ >> # store map from term name to feature integer index: we sort the >> # terms to have reproducible outcome for the vocabulary >> structure: >> # otherwise the mapping from feature name to indices might depend >> # on the memory layout of the machine. Furthermore sorted terms >> # might make it possible to perform binary search in the feature >> # names array. >> """ >> So if we store in the order first encountered, it wouldn't depend on >> the memory layout. It also avoids unnecessary matrix reorderings, so >> about 10% faster overall. I never really understood this comment. Memory layout? What memory layout? > Indeed but then you need to store the feature names in an ordered dict > instead of a regular dict. > AFAIK this is not available in Python 2.6 (which we still support). We > would have to backport an implementation in sklearn.utils.fixes for > python 2.6 compat. > Also ordered dict needs more memory than a dict. I suspect OrderedDict will be pretty expensive since it implements a linked list in pure Python (but I didn't measure). >> Why not have sorting as an option in the >> API with default False? > > That sounds reasonable. +0. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
