Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/
> On Nov 16, 2016, at 3:54 PM, Dale T Smith <[email protected]> wrote: > > Unbalanced class classification has been a topic here in past years, and > there are posts if you search the archives. There are also plenty of > resources available to help you, from actual code on Stackoverflow, to papers > that address various ideas. I don’t think it’s necessary to repeat any of > this on the mailing list. > > > __________________________________________________________________________________________________________________________________________ > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected] > > From: scikit-learn > [mailto:[email protected]] On Behalf Of > Fernando Marcos Wittmann > Sent: Wednesday, November 16, 2016 3:11 PM > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] suggested classification algorithm > > ⚠ EXT MSG: > Three based algorithms (like Random Forest) usually work well for imbalanced > datasets. You can also take a look at the SMOTE technique > (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for > over-sampling the positive observations. > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis <[email protected]> wrote: > Greetings, > > I want to design a program that can deal with classification problems of the > same type, where the number of positive observations is small but the number > of negative much larger. Speaking with numbers, the number of positive > observations could range usually between 2 to 20 and the number of negative > could be at least x30 times larger. The number of features could be between 2 > and 20 too, but that could be reduced using feature selection and elimination > algorithms. I 've read in the documentation that some algorithms like the SVM > are still effective when the number of dimensions is greater than the number > of samples, but I am not sure if they are suitable for my case. Moreover, > according to this Figure, the Nearest Neighbors is the best and second is the > RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my case > where the number of positive observations is very low. For these reasons I > would like to know your expert opinion about which classification algorithm > should I try first. > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: [email protected] > [email protected] > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening > attachments. > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
