Le 23 mars 2012 12:06, Philipp Singer <kill...@gmail.com> a écrit :
> Hey!
>
> I am currently using sklearn.feature_extraction.text.Vectorizer for feature
> extraction of text documents I have.
>
> I am now curious and don't quite understand how the TFIDF calculation is
> done. Is it done seperately for each document or based on all documents. It
> can't be done for each class of documents, because information about the
> labels is not available.

The IDF statistics is computed once on the whole training corpus as
passed to the `fit` method and then reused on each call to the
`transform` method.

For a train / test split on typically call fit_transform on the train
split (to compute the IDF vector on the train split only) and reuse
those IDF values on the test split by calling transform only:

>>> vec = TfidfVectorizer()
>>> tfidf_train = vec.fit_transform(documents_train)
>>> tfidf_test = vec.transform(documents_test)

The TF-IDF feature extraction per-se is unsupervised (it does not need
the labels). You can then train a supervised classifier on the output
to use the class of the document and pipeline both to get a document
classifier.

The new documentation is here:

  
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction

Here is a sample pipeline:

  
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to