The IDF statistics is computed once on the whole training corpus as
passed to the `fit` method and then reused on each call to the
`transform` method.

For a train / test split on typically call fit_transform on the train
split (to compute the IDF vector on the train split only) and reuse
those IDF values on the test split by calling transform only:

>>>  vec = TfidfVectorizer()
>>>  tfidf_train = vec.fit_transform(documents_train)
>>>  tfidf_test = vec.transform(documents_test)

The TF-IDF feature extraction per-se is unsupervised (it does not need
the labels). You can then train a supervised classifier on the output
to use the class of the document and pipeline both to get a document
classifier.

The new documentation is here:

   
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction

Here is a sample pipeline:

   
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
Alright, thanks for the ehads up. That's exactly the way I am using it.

Okay, so the tfidf values are for the whole corpus.

Wouldn't it make sense to just see documents belonging to one class as the 
corpus for the calculation?

Regards,
Philipp
 
<http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to