I just want to be sure I'm understanding something clearly. I am training two
SVC classifiers with probability=True and class_weight="auto". For the first,
my training data is 50% positive and 50% negative examples (svm1). For the
second, it is 10% positive and 90% negative (negative examples are much easier
to get than positive). Call the second one svm2.
If I then apply a test data set which is 10% positive and 90% negative to svm2
I get very good results. However, if I apply the same test data set to svm1
(trained on balanced data) I get good results on the positive cases but rather
poor on the negative cases. If I apply a balanced test set to svm1 I get very
good results. In each case I'm looking at the output of predict() though I am
also calling predict_proba().
So, why are the results different? Is it due to class_weight="auto" making the
balanced decision plane appear in a different position than the unbalanced case?
My intention is to build a classifier and then apply it to new feature vectors
one at a time and get a meaningful decision or probability output as to class
assignment. In practical use, there will be many, many more negative cases
than positive cases.
Should I train on a data set that has very few positive cases and many more
negative or should I keep it balanced?
Naively, I would think to train on a balanced data set because I would then be
making the "most use" of the positive cases I have. However, I could see
training on a data set where |negative| >> |positive| which more closely
reflects how the classifier would be used.
Any help appreciated!
Ron
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general