One problem with the PCA approach is also that it doesn’t tell you how “discriminative” these features are in a >2 dimensional space, e.g., by a nonlinear model. Or in other words, I think it is hard to tell whether the class imbalance is a big problem in this task just from looking at a linear transformation and compression of the dataset. I think looking at confusion matrices and ROC curves for some models could help to determine if the class imbalance is a challenge for a learning algorithm in higher dimensional space?
> On Nov 17, 2016, at 9:00 AM, Thomas Evangelidis <[email protected]> wrote: > > > Guys thank you all for your hints! Practical experience is irreplaceable > that's why I posted this query here. I could read all week the mailing list > archives and the respective internet resources but still not find the key > info I could potentially get by someone here. > > I did PCA on my training set (this one has 24 positive and 1278 negative > observation) and projected the 19 features on the first 2 PCs, which explain > 87.6 % of the variance in the data. Does this plot help to decide which > classification algorithms and/or over- or under-sampling would be more > suitable? > > https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png > > thanks for your advices > Thomas > > > On 16 November 2016 at 22:20, Sebastian Raschka <[email protected]> wrote: > Yeah, there are many useful resources and implementations scattered around > the web. However, a good, brief overview of the general ideas and concepts > would be this one, for example: > http://www.svds.com/learning-imbalanced-classes/ > > > > On Nov 16, 2016, at 3:54 PM, Dale T Smith <[email protected]> wrote: > > > > Unbalanced class classification has been a topic here in past years, and > > there are posts if you search the archives. There are also plenty of > > resources available to help you, from actual code on Stackoverflow, to > > papers that address various ideas. I don’t think it’s necessary to repeat > > any of this on the mailing list. > > > > > > __________________________________________________________________________________________________________________________________________ > > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > > 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected] > > > > From: scikit-learn > > [mailto:[email protected]] On Behalf > > Of Fernando Marcos Wittmann > > Sent: Wednesday, November 16, 2016 3:11 PM > > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] suggested classification algorithm > > > > ⚠ EXT MSG: > > Three based algorithms (like Random Forest) usually work well for > > imbalanced datasets. You can also take a look at the SMOTE technique > > (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for > > over-sampling the positive observations. > > > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis <[email protected]> > > wrote: > > Greetings, > > > > I want to design a program that can deal with classification problems of > > the same type, where the number of positive observations is small but the > > number of negative much larger. Speaking with numbers, the number of > > positive observations could range usually between 2 to 20 and the number of > > negative could be at least x30 times larger. The number of features could > > be between 2 and 20 too, but that could be reduced using feature selection > > and elimination algorithms. I 've read in the documentation that some > > algorithms like the SVM are still effective when the number of dimensions > > is greater than the number of samples, but I am not sure if they are > > suitable for my case. Moreover, according to this Figure, the Nearest > > Neighbors is the best and second is the RBF SVM: > > > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > > > However, I assume that Nearest Neighbors would not be effective in my case > > where the number of positive observations is very low. For these reasons I > > would like to know your expert opinion about which classification algorithm > > should I try first. > > > > thanks in advance > > Thomas > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: [email protected] > > [email protected] > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > [email protected] > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > -- > > > > Fernando Marcos Wittmann > > MS Student - Energy Systems Dept. > > School of Electrical and Computer Engineering, FEEC > > University of Campinas, UNICAMP, Brazil > > +55 (19) 987-211302 > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > > opening attachments. > > _______________________________________________ > > scikit-learn mailing list > > [email protected] > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: [email protected] > [email protected] > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > <PCA_of_features.png>_______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
