Try dividing the email in half and seeing if one half is takes much
more than 50% of the time.

Repeat until you have a sample that you can share :)

On Mon, Oct 1, 2012 at 8:44 PM, Ark <[email protected]> wrote:
>
>> >> 7s is very long. How long is your text document in bytes ?
>> > The text documents are around 50kB.
>>
>> That should not take 7s to extract a TF-IDF for a single 50kb
>> document. There must be a bug, can you please put a minimalistic code
>> snippet + example document that reproduce the issue on a gist?
>> http://gist.github.com
>
> [The document is an email and I am using the html portion[text/html] of the
>  email for the classification purpose. The size of the document is about 45K.
>  After training the vectorizer I dump it in a joblib file, and have a separate
>  python script that uses the classifier and vectorizer to classify the email. 
> So
>  that python script would have joblib.load to load the classifier and 
> vectorizer.
>  Before classifying I would have to transform the email in to tf-idf vectors,
>  hence run vectorizer.transform(input_txt) to get the email. I have noted the
>  time taken by the call using time() and cProfile, which show similar results.
>  The time taken by actual classifier.predict is trivial, so bottleneck is
>  vectorization as I mentioned in earlier post. Unfortunately, I cannot include
>  the email I use for training/test [since it is non-public], but I will try to
>  include the minimal code that reproduces the issue. In the meantime I am
>  including the cProfile run results.]
>
> https://gist.github.com/3815467
>
>
>
>
> ------------------------------------------------------------------------------
> Don't let slow site performance ruin your business. Deploy New Relic APM
> Deploy New Relic app performance management and know exactly
> what is happening inside your Ruby, Python, PHP, Java, and .NET app
> Try New Relic at no cost today and get our sweet Data Nerd shirt too!
> http://p.sf.net/sfu/newrelic-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Joseph Turian, Ph.D. | President, MetaOptimize
"Optimize Profits. Optimize Engagement."
http://metaoptimize.com
855-ALL-DATA

The web's most active forum for data scientists: http://metaoptimize.com/qa/

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to