Unbalanced class classification has been a topic here in past years, and there 
are posts if you search the archives. There are also plenty of resources 
available to help you, from actual code on Stackoverflow, to papers that 
address various ideas. I don’t think it’s necessary to repeat any of this on 
the mailing list.


__________________________________________________________________________________________________________________________________________
Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]

From: scikit-learn 
[mailto:[email protected]] On Behalf Of 
Fernando Marcos Wittmann
Sent: Wednesday, November 16, 2016 3:11 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] suggested classification algorithm

⚠ EXT MSG:
Three based algorithms (like Random Forest) usually work well for imbalanced 
datasets. You can also take a look at the SMOTE technique 
(http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for 
over-sampling the positive observations.

On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis 
<[email protected]<mailto:[email protected]>> wrote:
Greetings,

I want to design a program that can deal with classification problems of the 
same type, where the  number of positive observations is small but the number 
of negative much larger. Speaking with numbers, the number of positive 
observations could range usually between 2 to 20 and the number of negative 
could be at least x30 times larger. The number of features could be between 2 
and 20 too, but that could be reduced using feature selection and elimination 
algorithms. I 've read in the documentation that some algorithms like the SVM 
are still effective when the number of dimensions is greater than the number of 
samples, but I am not sure if they are suitable for my case. Moreover, 
according to this Figure, the Nearest Neighbors is the best and second is the 
RBF SVM:

http://scikit-learn.org/stable/_images/<http://learn.org/stable/_images/>sphx_glr_plot_classifier_comparison_001.png

However, I assume that Nearest Neighbors would not be effective in my case 
where the number of positive observations is very low. For these reasons I 
would like to know your expert opinion about which classification algorithm 
should I try first.

thanks in advance
Thomas


--

======================================================================

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic


email: [email protected]<mailto:[email protected]>

          [email protected]<mailto:[email protected]>

website: https://sites.google.com/site/thomasevangelidishomepage/


_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn



--

Fernando Marcos Wittmann
MS Student - Energy Systems Dept.
School of Electrical and Computer Engineering, FEEC
University of Campinas, UNICAMP, Brazil
+55 (19) 987-211302

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
attachments.
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to