The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method.For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those IDF values on the test split by calling transform only: >>> vec = TfidfVectorizer() >>> tfidf_train = vec.fit_transform(documents_train) >>> tfidf_test = vec.transform(documents_test) The TF-IDF feature extraction per-se is unsupervised (it does not need the labels). You can then train a supervised classifier on the output to use the class of the document and pipeline both to get a document classifier. The new documentation is here: http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction Here is a sample pipeline: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
Alright, thanks for the ehads up. That's exactly the way I am using it. Okay, so the tfidf values are for the whole corpus. Wouldn't it make sense to just see documents belonging to one class as the corpus for the calculation? Regards, Philipp <http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html>
------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general