This is weird. Are you sure it is not the other way around? The min_df parameter was reset from 2 to 1 afaik, which should give you a larger vocabulary in the git version, not a smaller.
On 03/14/2013 04:11 AM, Ark wrote: > The vectorized input with the same training data set differs with versions > 0.13.1 > and 0.14-git. > > For: > vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), > smooth_idf=True, sublinear_tf=True, max_df=0.5, > token_pattern=ur'\b(?!\d)\w\w+\b')) > > On fit_transform the shape of the input data > - with version 0.13.1 is (12440, 1270712) > - with version 0.14-git is (12440, 484762) > > I do not change code and run the same on two different machines in parallel, > apart from the number of features the size of the classifier goes from 8.4 > to > 26G, but I guess that is due to the number of features. Does this seem > correct? > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
