Hi Sean,
I'd be happy to review / propose enhancements to your documentation if you
decide to write up something.
I'm sure I'm not the only one out there ^_^
My experience with unbalanced settings comes from spam modeling which is a
cousin of your case.
Eustache
2013/8/29 Andreas Mueller <amuel...@ais.uni-bonn.de>
> Hi Sean.
> We were talking about adding topical how-tos to the website, and that
> indeed looks like
> a good candidate, as the question pops up a lot.
> If you want to write one, I'm sure you'll get valuable feedback :)
>
> Cheers,
> Andy
>
>
> On 08/29/2013 10:20 AM, Sean Violante wrote:
>
> I was wondering if some documentation could be prepared/is available for
> the workflow for handling unbalanced data.
>
> I am looking at Web site Click through data, but I am sure similar
> issues occur for other cases.
>
> This seemed to be one approach/steps required. I would value
> corrections, comments, but it seems like it would be useful to develop a
> tutorial to clarify a specific use-case, since it does seem to involve
> changes to a number of different steps.
>
> a) undersample the most frequent class [assuming you have plenty of data]
>
> b) We are still interested in the "true" probabilities so use logistic
> regression and reweight classes to adjust for undersampling. [rankSVM? Any
> models existing in sklearn? ]
>
> c) use AUC [or other metric for unbalanced classes]
>
> d) on validation/test sets either do not undersample or undersample but
> then reweight for the metric calulation.
>
> This "workflow" maintains the correct probabilities... another approach
> eg in order to use standard SVM is
>
> a) fit model on EITHER rebalanced data [if "too much frequent class"] OR
> reweight classes
>
> b) use AUC/ as cross-validation metric [in v14]
>
> c) use distance_function output... [ can predict_proba output in SVM be
> used including a reweighting scheme?]
>
> Lastly a tutorial on the SVM predict proba or other ways of generating
> probabilities from distance functions would be kind of useful in itself!!!
>
> Sean
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and
> save!http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Scikit-learn-general mailing
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general