Re: [Scikit-learn-general] TF-Idf

Olivier Grisel Sat, 13 Oct 2012 08:12:55 -0700

2012/10/13 Ark <[email protected]>:
> Olivier Grisel <olivier.grisel@...> writes:
>
>
>> > https://gist.github.com/3815467
>>
>> The offending line seems to be:
>>
>>         1    1.193    1.193    7.473    7.473 base.py:529(setdiag)
>>
>> which I don't understand how it could happen at predict time. At fit
>> time it could have been:
>>
>> https://github.com/scikit-learn/scikit-
> learn/blob/master/sklearn/feature_extraction/text.py#L648
>>
>> Which versions of numpy / scipy / scikit-learn are you using?
>>
> -----------> I am using scikit-learn 0.12
>
>> Can you try to turn off IDF normalization using `use_idf=False ` in
>> the constructor params of your vectorizer and retry (fit + predict) to
>> see if it's related to IDF normalization?
>>
> I turned IDF normalization off, but the time taken is still large. Will 
> profile
> and send if u want to see the vectorization.transform and fit_transform.
>
> ---------------------------------------------------------------
>> How many dimensions do you have in your fitted model?
>>
>> >>> print len(vectorizer.vocabulary_)
>>
>> How many documents do you have in your training corpus?
>>
>> How many non-zeros do you have in your transformed document?
>>
>> >>> print vectorizer.transform([my_text_document])
>>
>
>> Recently in scikit-learn 0.12 a new `min_df` parameter has been
>> introduced to control the model size and remove features that are
>> likely too noisy. Can you try to set it to `min_df=5` for instance and
>> see if it fixes your issue?
>>
> -------------------------------------------------------------------
> Here is the output from my last run during training:
>
> Dataset dir: ../dataset/
>
> Done generating filenames and the target array.
> Dumping label names... done
> Extracting features from dataset using TfidfVectorizer(analyzer=word,
> binary=False, charset=utf-8,
>         charset_error=strict, dtype=<type 'long'>, input=content,
>         lowercase=True, max_df=0.5, max_features=None, max_n=None,
>         min_df=5, min_n=None, ngram_range=(1, 2), norm=l2,
>         preprocessor=None, smooth_idf=True, stop_words=english,
>         strip_accents=None, sublinear_tf=True,
>         token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=True,
>         vocabulary=None).
> [TRAIN] Reading 12213 emails.
> ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▒ 100% 
> 12213/12213
> Feature extraction of training data done in 559.408841848 seconds
> Number of samples in training data: 12213
>  Number of features: 200424
>
> Dumping vectorizer... done.
> Dumping data vectors... done.
> Dumping target vectors... done.
> 2606 labels to train on.
> Training OneVsRestClassifier(estimator=LinearSVC(C=10, class_weight=None,
> dual=True, fit_intercept=True,
>      intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2, tol=0.0001,
>      verbose=0),
>           estimator__C=10, estimator__class_weight=None,
>           estimator__dual=True, estimator__fit_intercept=True,
>           estimator__intercept_scaling=1, estimator__loss=l2,
>           estimator__multi_class=ovr, estimator__penalty=l2,
>           estimator__tol=0.0001, estimator__verbose=0)
> done [1786.571s]
>
> Training time:  1786.57059312
> Input Data: (12213, 200424)


I don't see the number of non-zeros: could you please do:

>>> print vectorizer.transform([my_text_document])

as I asked previously? The run time should be linear with the number
of non zeros.

For reference, on my machine I have the following timing:

In [5]: from sklearn.datasets import fetch_20newsgroups

In [6]: from sklearn.feature_extraction.text import CountVectorizer

In [7]: twenty = fetch_20newsgroups()

In [8]: %time X = CountVectorizer().fit_transform(twenty.data)
CPU times: user 12.14 s, sys: 0.66 s, total: 12.80 s
Wall time: 13.12 s

In [9]: X
Out[9]:
<11314x56436 sparse matrix of type '<type 'numpy.int64'>'
        with 1713894 stored elements in COOrdinate format>

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to