Re: [Scikit-learn-general] TF-Idf

Olivier Grisel Tue, 02 Oct 2012 05:37:10 -0700

2012/10/2 Ark <[email protected]>:
>
>> >> 7s is very long. How long is your text document in bytes ?
>> > The text documents are around 50kB.
>>
>> That should not take 7s to extract a TF-IDF for a single 50kb
>> document. There must be a bug, can you please put a minimalistic code
>> snippet + example document that reproduce the issue on a gist?
>> http://gist.github.com
>
> [The document is an email and I am using the html portion[text/html] of the
>  email for the classification purpose. The size of the document is about 45K.
>  After training the vectorizer I dump it in a joblib file, and have a separate
>  python script that uses the classifier and vectorizer to classify the email. 
> So
>  that python script would have joblib.load to load the classifier and 
> vectorizer.
>  Before classifying I would have to transform the email in to tf-idf vectors,
>  hence run vectorizer.transform(input_txt) to get the email. I have noted the
>  time taken by the call using time() and cProfile, which show similar results.
>  The time taken by actual classifier.predict is trivial, so bottleneck is
>  vectorization as I mentioned in earlier post. Unfortunately, I cannot include
>  the email I use for training/test [since it is non-public], but I will try to
>  include the minimal code that reproduces the issue. In the meantime I am
>  including the cProfile run results.]
>
> https://gist.github.com/3815467


The offending line seems to be:

        1    1.193    1.193    7.473    7.473 base.py:529(setdiag)

which I don't understand how it could happen at predict time. At fit
time it could have been:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L648

Which versions of numpy / scipy / scikit-learn are you using?

Can you try to turn off IDF normalization using `use_idf=False ` in
the constructor params of your vectorizer and retry (fit + predict) to
see if it's related to IDF normalization?

How many dimensions do you have in your fitted model?

>>> print len(vectorizer.vocabulary_)

How many documents do you have in your training corpus?

How many non-zeros do you have in your transformed document?

>>> print vectorizer.transform([my_text_document])

Recently in scikit-learn 0.12 a new `min_df` parameter has been
introduced to control the model size and remove features that are
likely too noisy. Can you try to set it to `min_df=5` for instance and
see if it fixes your issue?

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to