>Can you try to turn off IDF normalization using `use_idf=False ` in >the constructor params of your vectorizer and retry (fit + predict) to >see if it's related to IDF normalization? >How many dimensions do you have in your fitted model?
https://gist.github.com/3933727 data_vectors.shape = (10361, 402061) > You just have 110 words in a 50kB email document? This is weird. Maybe > there is something special in the formatting of your data that makes > one element of the vectorizer run unusually slow. It's hard to tell > without a having access to a dataset that exhibits this performance > issue. > > Try to have a manual look a the content of the 20 news groups corpus: > > >>> twenty = twenty = fetch_20newsgroups() > >>> print twenty.data[0] > >>> print twenty.data[1] > ... > > And see of the text of the documents of the twenty newsgroups data > which is fast to vectorize is very different from the content of your > documents that are slow to vectorize. > I separate the email parts('text/plain', 'text/html' or text/pdf') before vectorizing the email and that part is vectorized, same with the training process. I don't think text is very much different in structure, only of course the content. Turning off idf gave some speedup but still vectorization + predict seems to take time. ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
