Re: [Scikit-learn-general] A question on training models with unbalanced data and testing cases

Bilal Dadanlar Wed, 25 Sep 2013 07:04:09 -0700

Hi Ron,

The reason why .predict() and .predict_proba doesn't agree is about the
method (Plott's scaling) by which probability values are generated. You can
have a look at my answer here:
http://stackoverflow.com/questions/17017882/scikit-learn-predict-proba-gives-wrong-answers/17142391#17142391
if you don't need probablity values you can use .decision_function() for
understanding how probable each class is. Then you will get some signed
floating point values for ranking each class.




On Mon, Sep 16, 2013 at 11:35 PM, Ron Kneusel <oneelkr...@hotmail.com>wrote:

> I just want to be sure I'm understanding something clearly.  I am training
> two SVC classifiers with probability=True and class_weight="auto".  For the
> first, my training data is 50% positive and 50% negative examples (svm1).
>  For the second, it is 10% positive and 90% negative (negative examples are
> much easier to get than positive).  Call the second one svm2.
>
> If I then apply a test data set which is 10% positive and 90% negative to
> svm2 I get very good results.  However, if I apply the same test data set
> to svm1 (trained on balanced data) I get good results on the positive cases
> but rather poor on the negative cases.  If I apply a balanced test set to
> svm1 I get very good results.  In each case I'm looking at the output of
> predict() though I am also calling predict_proba().
>
> So, why are the results different?  Is it due to class_weight="auto"
> making the balanced decision plane appear in a different position than the
> unbalanced case?
>
> My intention is to build a classifier and then apply it to new feature
> vectors one at a time and get a meaningful decision or probability output
> as to class assignment.  In practical use, there will be many, many more
> negative cases than positive cases.
>
> Should I train on a data set that has very few positive cases and many
> more negative or should I keep it balanced?
>
> Naively, I would think to train on a balanced data set because I would
> then be making the "most use" of the positive cases I have.  However, I
> could see training on a data set where |negative| >> |positive| which more
> closely reflects how the classifier would be used.
>
> Any help appreciated!
>
> Ron
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Bilal Dadanlar
cimri.com | Yazılım Mühendisi

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] A question on training models with unbalanced data and testing cases

Reply via email to