Le 30 mars 2012 14:50, Philipp Singer <kill...@gmail.com> a écrit : > > I just have another question regarding this because some of my coworkers > brought this idea up and I cant argue about it the way I like. > > So let's assume you have 10 documents in the training test set and 10 > documents in the test set. > > My coworker now suggest instead of taking each document as an own > training example, group together each document of the same class and use > this new document as a training example. > > So for example if you have 3 classes, take three training documents > where for example > > sample1 = doc 1 + doc 2 + doc 3 + doc 4 > sample2 = doc 5 + doc 6 + doc 7 > sample3 = doc 8 + doc 9 + doc 10 > > For the test set you still classify the 10 documents independently. > > I hope I have made this problem somehow clear.
This is called Rocchio Classification [1] and is implemented by the NearestCentroid model in this pull request: https://github.com/scikit-learn/scikit-learn/pull/690 BTW: doing text classification with a training dataset of only 10 samples is bound to produce overfitted models with no practical value. IMHO there is no point in doing text classification with less than a couple of hundreds samples per class [1] http://nlp.stanford.edu/IR-book/html/htmledition/rocchio-classification-1.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general