Le 23 mars 2012 12:06, Philipp Singer <kill...@gmail.com> a écrit : > Hey! > > I am currently using sklearn.feature_extraction.text.Vectorizer for feature > extraction of text documents I have. > > I am now curious and don't quite understand how the TFIDF calculation is > done. Is it done seperately for each document or based on all documents. It > can't be done for each class of documents, because information about the > labels is not available.
The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method. For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those IDF values on the test split by calling transform only: >>> vec = TfidfVectorizer() >>> tfidf_train = vec.fit_transform(documents_train) >>> tfidf_test = vec.transform(documents_test) The TF-IDF feature extraction per-se is unsupervised (it does not need the labels). You can then train a supervised classifier on the output to use the class of the document and pipeline both to get a document classifier. The new documentation is here: http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction Here is a sample pipeline: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general