>Can you try to turn off IDF normalization using `use_idf=False ` in
>the constructor params of your vectorizer and retry (fit + predict) to
>see if it's related to IDF normalization?
>How many dimensions do you have in your fitted model?

https://gist.github.com/3933727
data_vectors.shape = (10361, 402061)


> You just have 110 words in a 50kB email document? This is weird. Maybe
> there is something special in the formatting of your data that makes
> one element of the vectorizer run unusually slow. It's hard to tell
> without a having access to a dataset that exhibits this performance
> issue.
> 
> Try to have a manual look a the content of the 20 news groups corpus:
> 
> >>> twenty = twenty = fetch_20newsgroups()
> >>> print twenty.data[0]
> >>> print twenty.data[1]
> ...
> 
> And see of the text of the documents of the twenty newsgroups data
> which is fast to vectorize is very different from the content of your
> documents that are slow to vectorize.
> 
 I separate the email parts('text/plain', 'text/html' or text/pdf') before 
vectorizing the email and that part is vectorized, same with the training 
process. I don't think text is very much different in structure, only of course 
the content. Turning off idf gave some speedup but still vectorization +  
predict 
seems to take time.

   


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to