Re: [Scikit-learn-general] Text Documents - Vectorizer

Olivier Grisel Fri, 30 Mar 2012 06:38:43 -0700

Le 30 mars 2012 14:50, Philipp Singer <kill...@gmail.com> a écrit :
>
> I just have another question regarding this because some of my coworkers
> brought this idea up and I cant argue about it the way I like.
>
> So let's assume you have 10 documents in the training test set and 10
> documents in the test set.
>
> My coworker now suggest instead of taking each document as an own
> training example, group together each document of the same class and use
> this new document as a training example.
>
> So for example if you have 3 classes, take three training documents
> where for example
>
> sample1 = doc 1 + doc 2 + doc 3 + doc 4
> sample2 = doc 5 + doc 6 + doc 7
> sample3 = doc 8 + doc 9 + doc 10
>
> For the test set you still classify the 10 documents independently.
>
> I hope I have made this problem somehow clear.


This is called Rocchio Classification [1] and is implemented by the
NearestCentroid model in this pull request:

  https://github.com/scikit-learn/scikit-learn/pull/690

BTW: doing text classification with a training dataset of only 10
samples is bound to produce overfitted models with no practical value.
IMHO there is no point in doing text classification with less than a
couple of hundreds samples per class

[1] 
http://nlp.stanford.edu/IR-book/html/htmledition/rocchio-classification-1.html

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Text Documents - Vectorizer

Reply via email to