Le 23 mars 2012 13:27, Philipp Singer <kill...@gmail.com> a écrit :
> The IDF statistics is computed once on the whole training corpus as
> passed to the `fit` method and then reused on each call to the
> `transform` method.
> For a train / test split on typically call fit_transform on the train
> split (to compute the IDF vector on the train split only) and reuse
> those IDF values on the test split by calling transform only:
>>>> vec = TfidfVectorizer()
>>>> tfidf_train = vec.fit_transform(documents_train)
>>>> tfidf_test = vec.transform(documents_test)
> The TF-IDF feature extraction per-se is unsupervised (it does not need
> the labels). You can then train a supervised classifier on the output
> to use the class of the document and pipeline both to get a document
> classifier.
> The new documentation is here:
> http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
> Here is a sample pipeline:
> http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
> Alright, thanks for the ehads up. That's exactly the way I am using it.
> Okay, so the tfidf values are for the whole corpus.

Well not exactly: the IDF weights are "trained" on the training slice
of the corpus and can then be reused for the new data from the test

> Wouldn't it make sense to just see documents belonging to one class as the
> corpus for the calculation?

I don't understand how you would be able to train a classifier on
features with values that depends on the class: on the new test data
you won't have the label info hence won't be able to extract the
features for prediction.

http://twitter.com/ogrisel - http://github.com/ogrisel

This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
Scikit-learn-general mailing list

Reply via email to