Olivier Grisel <olivier.grisel@...> writes:
> > https://gist.github.com/3815467 > > The offending line seems to be: > > 1 1.193 1.193 7.473 7.473 base.py:529(setdiag) > > which I don't understand how it could happen at predict time. At fit > time it could have been: > > https://github.com/scikit-learn/scikit- learn/blob/master/sklearn/feature_extraction/text.py#L648 > > Which versions of numpy / scipy / scikit-learn are you using? > -----------> I am using scikit-learn 0.12 > Can you try to turn off IDF normalization using `use_idf=False ` in > the constructor params of your vectorizer and retry (fit + predict) to > see if it's related to IDF normalization? > I turned IDF normalization off, but the time taken is still large. Will profile and send if u want to see the vectorization.transform and fit_transform. --------------------------------------------------------------- > How many dimensions do you have in your fitted model? > > >>> print len(vectorizer.vocabulary_) > > How many documents do you have in your training corpus? > > How many non-zeros do you have in your transformed document? > > >>> print vectorizer.transform([my_text_document]) > > Recently in scikit-learn 0.12 a new `min_df` parameter has been > introduced to control the model size and remove features that are > likely too noisy. Can you try to set it to `min_df=5` for instance and > see if it fixes your issue? > ------------------------------------------------------------------- Here is the output from my last run during training: Dataset dir: ../dataset/ Done generating filenames and the target array. Dumping label names... done Extracting features from dataset using TfidfVectorizer(analyzer=word, binary=False, charset=utf-8, charset_error=strict, dtype=<type 'long'>, input=content, lowercase=True, max_df=0.5, max_features=None, max_n=None, min_df=5, min_n=None, ngram_range=(1, 2), norm=l2, preprocessor=None, smooth_idf=True, stop_words=english, strip_accents=None, sublinear_tf=True, token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=True, vocabulary=None). [TRAIN] Reading 12213 emails. ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▒ 100% 12213/12213 Feature extraction of training data done in 559.408841848 seconds Number of samples in training data: 12213 Number of features: 200424 Dumping vectorizer... done. Dumping data vectors... done. Dumping target vectors... done. 2606 labels to train on. Training OneVsRestClassifier(estimator=LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss=l2, multi_class=ovr, penalty=l2, tol=0.0001, verbose=0), estimator__C=10, estimator__class_weight=None, estimator__dual=True, estimator__fit_intercept=True, estimator__intercept_scaling=1, estimator__loss=l2, estimator__multi_class=ovr, estimator__penalty=l2, estimator__tol=0.0001, estimator__verbose=0) done [1786.571s] Training time: 1786.57059312 Input Data: (12213, 200424) ------------------------------------------------------------------------ I am training the classifier and vectorizer, dumping it using joblib.dump. Then when I need to predict load vectorizer and classifier, vectorize the email with vectorizer.transform , and then predict the category. Btw the feature extraction above includes the time taken to read emails from disk since I am using a generator to avoid having all of them in memory. ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
