Re: [Scikit-learn-general] TF-Idf

Ark Fri, 12 Oct 2012 17:14:25 -0700

Olivier Grisel <olivier.grisel@...> writes:


> > https://gist.github.com/3815467
> 
> The offending line seems to be:
> 
>         1    1.193    1.193    7.473    7.473 base.py:529(setdiag)
> 
> which I don't understand how it could happen at predict time. At fit
> time it could have been:
> 
> https://github.com/scikit-learn/scikit-
learn/blob/master/sklearn/feature_extraction/text.py#L648
> 
> Which versions of numpy / scipy / scikit-learn are you using?
> 
-----------> I am using scikit-learn 0.12

> Can you try to turn off IDF normalization using `use_idf=False ` in
> the constructor params of your vectorizer and retry (fit + predict) to
> see if it's related to IDF normalization?
>
I turned IDF normalization off, but the time taken is still large. Will profile
and send if u want to see the vectorization.transform and fit_transform.
 
---------------------------------------------------------------
> How many dimensions do you have in your fitted model?
> 
> >>> print len(vectorizer.vocabulary_)
> 
> How many documents do you have in your training corpus?
> 
> How many non-zeros do you have in your transformed document?
> 
> >>> print vectorizer.transform([my_text_document])
> 

> Recently in scikit-learn 0.12 a new `min_df` parameter has been
> introduced to control the model size and remove features that are
> likely too noisy. Can you try to set it to `min_df=5` for instance and
> see if it fixes your issue?
>
-------------------------------------------------------------------
Here is the output from my last run during training: 

Dataset dir: ../dataset/
        
Done generating filenames and the target array.
Dumping label names... done
Extracting features from dataset using TfidfVectorizer(analyzer=word, 
binary=False, charset=utf-8,
        charset_error=strict, dtype=<type 'long'>, input=content,
        lowercase=True, max_df=0.5, max_features=None, max_n=None,
        min_df=5, min_n=None, ngram_range=(1, 2), norm=l2,
        preprocessor=None, smooth_idf=True, stop_words=english,
        strip_accents=None, sublinear_tf=True,
        token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=True,
        vocabulary=None).
[TRAIN] Reading 12213 emails.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▒ 100% 12213/12213
Feature extraction of training data done in 559.408841848 seconds
Number of samples in training data: 12213
 Number of features: 200424

Dumping vectorizer... done.
Dumping data vectors... done.
Dumping target vectors... done.
2606 labels to train on.
Training OneVsRestClassifier(estimator=LinearSVC(C=10, class_weight=None, 
dual=True, fit_intercept=True,
     intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2, tol=0.0001,
     verbose=0),
          estimator__C=10, estimator__class_weight=None,
          estimator__dual=True, estimator__fit_intercept=True,
          estimator__intercept_scaling=1, estimator__loss=l2,
          estimator__multi_class=ovr, estimator__penalty=l2,
          estimator__tol=0.0001, estimator__verbose=0)
done [1786.571s]

Training time:  1786.57059312
Input Data: (12213, 200424)

------------------------------------------------------------------------

I am training the classifier and vectorizer, dumping it using joblib.dump. Then 
when I need to predict load vectorizer and classifier, vectorize the email with
vectorizer.transform , and then predict the category. Btw the feature extraction
above includes the time taken to read emails from disk since I am using a 
generator to avoid having all of them in memory.
 


------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TF-Idf

Reply via email to