2012/10/22 Ark <[email protected]>: > e if it's related to IDF normalization? >> >> How many dimensions do you have in your fitted model? >> >> >>> print len(vectorizer.vocabulary_) >> >> How many documents do you have in your training corpus? >> >> How many non-zeros do you have in your transformed document? >> >> >>> print vectorizer.transform([my_text_document]) > > > In [30]: print vectorizer.transform([input_txt]).data.shape > (110,)
You just have 110 words in a 50kB email document? This is weird. Maybe there is something special in the formatting of your data that makes one element of the vectorizer run unusually slow. It's hard to tell without a having access to a dataset that exhibits this performance issue. Try to have a manual look a the content of the 20 news groups corpus: >>> twenty = twenty = fetch_20newsgroups() >>> print twenty.data[0] >>> print twenty.data[1] ... And see of the text of the documents of the twenty newsgroups data which is fast to vectorize is very different from the content of your documents that are slow to vectorize. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
