I was wondering if some documentation could be prepared/is available for
the workflow for handling unbalanced data.
I am looking at Web site Click through data, but I am sure similar issues
occur for other cases.
This seemed to be one approach/steps required. I would value corrections,
comments, but it seems like it would be useful to develop a tutorial to
clarify a specific use-case, since it does seem to involve changes to a
number of different steps.
a) undersample the most frequent class [assuming you have plenty of data]
b) We are still interested in the "true" probabilities so use logistic
regression and reweight classes to adjust for undersampling. [rankSVM? Any
models existing in sklearn? ]
c) use AUC [or other metric for unbalanced classes]
d) on validation/test sets either do not undersample or undersample but
then reweight for the metric calulation.
This "workflow" maintains the correct probabilities... another approach eg
in order to use standard SVM is
a) fit model on EITHER rebalanced data [if "too much frequent class"] OR
reweight classes
b) use AUC/ as cross-validation metric [in v14]
c) use distance_function output... [ can predict_proba output in SVM be
used including a reweighting scheme?]
Lastly a tutorial on the SVM predict proba or other ways of generating
probabilities from distance functions would be kind of useful in itself!!!
Sean
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general