Le 23 mars 2012 13:27, Philipp Singer <kill...@gmail.com> a écrit : > The IDF statistics is computed once on the whole training corpus as > passed to the `fit` method and then reused on each call to the > `transform` method. > > For a train / test split on typically call fit_transform on the train > split (to compute the IDF vector on the train split only) and reuse > those IDF values on the test split by calling transform only: > >>>> vec = TfidfVectorizer() >>>> tfidf_train = vec.fit_transform(documents_train) >>>> tfidf_test = vec.transform(documents_test) > > The TF-IDF feature extraction per-se is unsupervised (it does not need > the labels). You can then train a supervised classifier on the output > to use the class of the document and pipeline both to get a document > classifier. > > The new documentation is here: > > > http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction > > Here is a sample pipeline: > > > http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html > > Alright, thanks for the ehads up. That's exactly the way I am using it. > > Okay, so the tfidf values are for the whole corpus.
Well not exactly: the IDF weights are "trained" on the training slice of the corpus and can then be reused for the new data from the test corpus. > Wouldn't it make sense to just see documents belonging to one class as the > corpus for the calculation? I don't understand how you would be able to train a classifier on features with values that depends on the class: on the new test data you won't have the label info hence won't be able to extract the features for prediction. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general