2013/3/11 Lars Buitinck <[email protected]>: > 2013/3/11 Olivier Grisel <[email protected]>: >> 2013/3/11 Roman Sinayev <[email protected]>: >>> I got CountVectorizer about 2x faster without multiprocessing so far, >>> however I have a couple of questions. > > I'm curious how you pulled that off. > >>> 1. Why do we not use max_df and min_df and max_features when custom >>> vocabulary is provided? >> >> It seems weird to me to mutate the dictionary of the caller. I would >> not expect that. > > Agree. > >>> """ >>> # store map from term name to feature integer index: we sort the >>> # terms to have reproducible outcome for the vocabulary >>> structure: >>> # otherwise the mapping from feature name to indices might >>> depend >>> # on the memory layout of the machine. Furthermore sorted terms >>> # might make it possible to perform binary search in the feature >>> # names array. >>> """ >>> So if we store in the order first encountered, it wouldn't depend on >>> the memory layout. It also avoids unnecessary matrix reorderings, so >>> about 10% faster overall. > > I never really understood this comment. Memory layout? What memory layout?
The internal state of the python dict used to store the vocabulary. Maybe this is not the right word hand as more to do with the randomization of the seed using the hash function of the python dict. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
