2013/3/14 Ark <[email protected]>: > For: > vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), > smooth_idf=True, sublinear_tf=True, max_df=0.5, > token_pattern=ur'\b(?!\d)\w\w+\b')) > > On fit_transform the shape of the input data > - with version 0.13.1 is (12440, 1270712) > - with version 0.14-git is (12440, 484762) > > I do not change code and run the same on two different machines in parallel, > apart from the number of features the size of the classifier goes from 8.4 to > 26G, but I guess that is due to the number of features. Does this seem > correct?
This is unexpected. Can you inspect the vocabulary_ on both vectorizers? Try computing their set.intersection, set.difference, set.symmetric_difference (all Python builtins). -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
