Yeah, there are many useful resources and implementations scattered around the 
web. However, a good, brief overview of the general ideas and concepts would be 
this one, for example: http://www.svds.com/learning-imbalanced-classes/ 


> On Nov 16, 2016, at 3:54 PM, Dale T Smith <[email protected]> wrote:
> 
> Unbalanced class classification has been a topic here in past years, and 
> there are posts if you search the archives. There are also plenty of 
> resources available to help you, from actual code on Stackoverflow, to papers 
> that address various ideas. I don’t think it’s necessary to repeat any of 
> this on the mailing list.
>  
>  
> __________________________________________________________________________________________________________________________________________
> Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 
> 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]
>  
> From: scikit-learn 
> [mailto:[email protected]] On Behalf Of 
> Fernando Marcos Wittmann
> Sent: Wednesday, November 16, 2016 3:11 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] suggested classification algorithm
>  
> ⚠ EXT MSG:
> Three based algorithms (like Random Forest) usually work well for imbalanced 
> datasets. You can also take a look at the SMOTE technique 
> (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for 
> over-sampling the positive observations. 
>  
> On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis <[email protected]> wrote:
> Greetings,
>  
> I want to design a program that can deal with classification problems of the 
> same type, where the  number of positive observations is small but the number 
> of negative much larger. Speaking with numbers, the number of positive 
> observations could range usually between 2 to 20 and the number of negative 
> could be at least x30 times larger. The number of features could be between 2 
> and 20 too, but that could be reduced using feature selection and elimination 
> algorithms. I 've read in the documentation that some algorithms like the SVM 
> are still effective when the number of dimensions is greater than the number 
> of samples, but I am not sure if they are suitable for my case. Moreover, 
> according to this Figure, the Nearest Neighbors is the best and second is the 
> RBF SVM:
>  
> http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png
>  
> However, I assume that Nearest Neighbors would not be effective in my case 
> where the number of positive observations is very low. For these reasons I 
> would like to know your expert opinion about which classification algorithm 
> should I try first.
>  
> thanks in advance
> Thomas
>  
>  
> -- 
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081, 
> 62500 Brno, Czech Republic 
>  
> email: [email protected]
>           [email protected]
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
>  
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
>  
> -- 
> 
> Fernando Marcos Wittmann
> MS Student - Energy Systems Dept. 
> School of Electrical and Computer Engineering, FEEC
> University of Campinas, UNICAMP, Brazil
> +55 (19) 987-211302
>  
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
> attachments.
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to