Re: [Scikit-learn-general] Text Documents - Vectorizer

Philipp Singer Fri, 30 Mar 2012 05:51:23 -0700

Am 23.03.2012 13:58, schrieb Olivier Grisel:
> Le 23 mars 2012 13:27, Philipp Singer<kill...@gmail.com>  a écrit :
>> The IDF statistics is computed once on the whole training corpus as
>> passed to the `fit` method and then reused on each call to the
>> `transform` method.
>>
>> For a train / test split on typically call fit_transform on the train
>> split (to compute the IDF vector on the train split only) and reuse
>> those IDF values on the test split by calling transform only:
>>
>>>>> vec = TfidfVectorizer()
>>>>> tfidf_train = vec.fit_transform(documents_train)
>>>>> tfidf_test = vec.transform(documents_test)
>> The TF-IDF feature extraction per-se is unsupervised (it does not need
>> the labels). You can then train a supervised classifier on the output
>> to use the class of the document and pipeline both to get a document
>> classifier.
>>
>> The new documentation is here:
>>
>>
>> http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
>>
>> Here is a sample pipeline:
>>
>>
>> http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
>>
>> Alright, thanks for the ehads up. That's exactly the way I am using it.
>>
>> Okay, so the tfidf values are for the whole corpus.
> Well not exactly: the IDF weights are "trained" on the training slice
> of the corpus and can then be reused for the new data from the test
> corpus.
>
>> Wouldn't it make sense to just see documents belonging to one class as the
>> corpus for the calculation?
> I don't understand how you would be able to train a classifier on
> features with values that depends on the class: on the new test data
> you won't have the label info hence won't be able to extract the
> features for prediction.
>


Thanks for the heads up! It's way more clear now, I havent thought about 
that simple aspect.

I just have another question regarding this because some of my coworkers 
brought this idea up and I cant argue about it the way I like.

So let's assume you have 10 documents in the training test set and 10 
documents in the test set.

My coworker now suggest instead of taking each document as an own 
training example, group together each document of the same class and use 
this new document as a training example.

So for example if you have 3 classes, take three training documents 
where for example

sample1 = doc 1 + doc 2 + doc 3 + doc 4
sample2 = doc 5 + doc 6 + doc 7
sample3 = doc 8 + doc 9 + doc 10

For the test set you still classify the 10 documents independently.

I hope I have made this problem somehow clear.

Hope you can help me.

Thanks and many regards,
Philipp

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Text Documents - Vectorizer

Reply via email to