I'm dealing with a 50-class classification problem with extremely
unbalanced classes. The smallest class has about 1000 samples and the
largest has 500,000. The random forest I've trained is being heavily
skewed towards the big classes.
Is there a good way to deal with this kind of problem in sklearn as of
now? Or is there room to implement some kind of stratified bootstrap
strategy or a weighting strategy (as in
here<http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf>,
for example)?
What other non-linear classifiers in sklearn would be good for this kind of
dataset? It's about 2 million examples in 500+ dimensions.
Thanks,
Sergey
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general