The problem with your analysis is it doesn’t include anything but features. You may want to look at Nina Zumel and John Mount’s work on y-aware PCR and PCA, as well as y-aware feature scaling.
http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/ http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/ http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/ __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected] From: scikit-learn [mailto:[email protected]] On Behalf Of Thomas Evangelidis Sent: Thursday, November 17, 2016 9:01 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] suggested classification algorithm ⚠ EXT MSG: Guys thank you all for your hints! Practical experience is irreplaceable that's why I posted this query here. I could read all week the mailing list archives and the respective internet resources but still not find the key info I could potentially get by someone here. I did PCA on my training set (this one has 24 positive and 1278 negative observation) and projected the 19 features on the first 2 PCs, which explain 87.6 % of the variance in the data. Does this plot help to decide which classification algorithms and/or over- or under-sampling would be more suitable? https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png thanks for your advices Thomas On 16 November 2016 at 22:20, Sebastian Raschka <[email protected]<mailto:[email protected]>> wrote: Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/ > On Nov 16, 2016, at 3:54 PM, Dale T Smith > <[email protected]<mailto:[email protected]>> wrote: > > Unbalanced class classification has been a topic here in past years, and > there are posts if you search the archives. There are also plenty of > resources available to help you, from actual code on Stackoverflow, to papers > that address various ideas. I don’t think it’s necessary to repeat any of > this on the mailing list. > > > __________________________________________________________________________________________________________________________________________ > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > 5985 State Bridge Road, Johns Creek, GA 30097 | > [email protected]<mailto:[email protected]> > > From: scikit-learn > [mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>[email protected]<mailto:[email protected]>] > On Behalf Of Fernando Marcos Wittmann > Sent: Wednesday, November 16, 2016 3:11 PM > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] suggested classification algorithm > > ⚠ EXT MSG: > Three based algorithms (like Random Forest) usually work well for imbalanced > datasets. You can also take a look at the SMOTE technique > (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for > over-sampling the positive observations. > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis > <[email protected]<mailto:[email protected]>> wrote: > Greetings, > > I want to design a program that can deal with classification problems of the > same type, where the number of positive observations is small but the number > of negative much larger. Speaking with numbers, the number of positive > observations could range usually between 2 to 20 and the number of negative > could be at least x30 times larger. The number of features could be between 2 > and 20 too, but that could be reduced using feature selection and elimination > algorithms. I 've read in the documentation that some algorithms like the SVM > are still effective when the number of dimensions is greater than the number > of samples, but I am not sure if they are suitable for my case. Moreover, > according to this Figure, the Nearest Neighbors is the best and second is the > RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my case > where the number of positive observations is very low. For these reasons I > would like to know your expert opinion about which classification algorithm > should I try first. > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: [email protected]<mailto:[email protected]> > [email protected]<mailto:[email protected]> > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > [email protected]<mailto:[email protected]> > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302<tel:%2B55%20%2819%29%20987-211302> > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening > attachments. > _______________________________________________ > scikit-learn mailing list > [email protected]<mailto:[email protected]> > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list [email protected]<mailto:[email protected]> https://mail.python.org/mailman/listinfo/scikit-learn -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: [email protected]<mailto:[email protected]> [email protected]<mailto:[email protected]> website: https://sites.google.com/site/thomasevangelidishomepage/ * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
