I just want to be sure I'm understanding something clearly.  I am training two 
SVC classifiers with probability=True and class_weight="auto".  For the first, 
my training data is 50% positive and 50% negative examples (svm1).  For the 
second, it is 10% positive and 90% negative (negative examples are much easier 
to get than positive).  Call the second one svm2.
If I then apply a test data set which is 10% positive and 90% negative to svm2 
I get very good results.  However, if I apply the same test data set to svm1 
(trained on balanced data) I get good results on the positive cases but rather 
poor on the negative cases.  If I apply a balanced test set to svm1 I get very 
good results.  In each case I'm looking at the output of predict() though I am 
also calling predict_proba().
So, why are the results different?  Is it due to class_weight="auto" making the 
balanced decision plane appear in a different position than the unbalanced case?
My intention is to build a classifier and then apply it to new feature vectors 
one at a time and get a meaningful decision or probability output as to class 
assignment.  In practical use, there will be many, many more negative cases 
than positive cases.
Should I train on a data set that has very few positive cases and many more 
negative or should I keep it balanced?
Naively, I would think to train on a balanced data set because I would then be 
making the "most use" of the positive cases I have.  However, I could see 
training on a data set where |negative| >> |positive| which more closely 
reflects how the classifier would be used.
Any help appreciated!
Ron
                                          
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to