> >> 7s is very long. How long is your text document in bytes ? > > The text documents are around 50kB. > > That should not take 7s to extract a TF-IDF for a single 50kb > document. There must be a bug, can you please put a minimalistic code > snippet + example document that reproduce the issue on a gist? > http://gist.github.com
[The document is an email and I am using the html portion[text/html] of the email for the classification purpose. The size of the document is about 45K. After training the vectorizer I dump it in a joblib file, and have a separate python script that uses the classifier and vectorizer to classify the email. So that python script would have joblib.load to load the classifier and vectorizer. Before classifying I would have to transform the email in to tf-idf vectors, hence run vectorizer.transform(input_txt) to get the email. I have noted the time taken by the call using time() and cProfile, which show similar results. The time taken by actual classifier.predict is trivial, so bottleneck is vectorization as I mentioned in earlier post. Unfortunately, I cannot include the email I use for training/test [since it is non-public], but I will try to include the minimal code that reproduces the issue. In the meantime I am including the cProfile run results.] https://gist.github.com/3815467 ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
