Re: [Scikit-learn-general] TF-Idf

Ark Mon, 01 Oct 2012 17:46:16 -0700

> >> 7s is very long. How long is your text document in bytes ?
> > The text documents are around 50kB.
> 
> That should not take 7s to extract a TF-IDF for a single 50kb
> document. There must be a bug, can you please put a minimalistic code
> snippet + example document that reproduce the issue on a gist?
> http://gist.github.com


[The document is an email and I am using the html portion[text/html] of the
 email for the classification purpose. The size of the document is about 45K.
 After training the vectorizer I dump it in a joblib file, and have a separate
 python script that uses the classifier and vectorizer to classify the email. So
 that python script would have joblib.load to load the classifier and 
vectorizer.
 Before classifying I would have to transform the email in to tf-idf vectors,
 hence run vectorizer.transform(input_txt) to get the email. I have noted the
 time taken by the call using time() and cProfile, which show similar results.
 The time taken by actual classifier.predict is trivial, so bottleneck is
 vectorization as I mentioned in earlier post. Unfortunately, I cannot include
 the email I use for training/test [since it is non-public], but I will try to
 include the minimal code that reproduces the issue. In the meantime I am
 including the cProfile run results.]

https://gist.github.com/3815467




------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to