One problem with the PCA approach is also that it doesn’t tell you how 
“discriminative” these features are in a >2 dimensional space, e.g., by a 
nonlinear model. Or in other words, I think it is hard to tell whether the 
class imbalance is a big problem in this task just from looking at a linear 
transformation and compression of the dataset. I think looking at confusion 
matrices and ROC curves for some models could help to determine if the class 
imbalance is a challenge for a learning algorithm in higher dimensional space?

> On Nov 17, 2016, at 9:00 AM, Thomas Evangelidis <[email protected]> wrote:
> 
> 
> Guys thank you all for your hints! Practical experience is irreplaceable 
> that's why I posted this query here. I could read all week the mailing list 
> archives and the respective internet resources but still not find the key 
> info I could potentially get by someone here.
> 
> I did PCA on my training set (this one has 24 positive and 1278 negative 
> observation) and projected the 19 features on the first 2 PCs, which explain 
> 87.6 % of the variance in the data. Does this plot help to decide which 
> classification algorithms and/or over- or under-sampling would be more 
> suitable?
> 
> https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png
> 
> thanks for your advices
> Thomas
> 
> 
> On 16 November 2016 at 22:20, Sebastian Raschka <[email protected]> wrote:
> Yeah, there are many useful resources and implementations scattered around 
> the web. However, a good, brief overview of the general ideas and concepts 
> would be this one, for example: 
> http://www.svds.com/learning-imbalanced-classes/
> 
> 
> > On Nov 16, 2016, at 3:54 PM, Dale T Smith <[email protected]> wrote:
> >
> > Unbalanced class classification has been a topic here in past years, and 
> > there are posts if you search the archives. There are also plenty of 
> > resources available to help you, from actual code on Stackoverflow, to 
> > papers that address various ideas. I don’t think it’s necessary to repeat 
> > any of this on the mailing list.
> >
> >
> > __________________________________________________________________________________________________________________________________________
> > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
> > 5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]
> >
> > From: scikit-learn 
> > [mailto:[email protected]] On Behalf 
> > Of Fernando Marcos Wittmann
> > Sent: Wednesday, November 16, 2016 3:11 PM
> > To: Scikit-learn user and developer mailing list
> > Subject: Re: [scikit-learn] suggested classification algorithm
> >
> > ⚠ EXT MSG:
> > Three based algorithms (like Random Forest) usually work well for 
> > imbalanced datasets. You can also take a look at the SMOTE technique 
> > (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for 
> > over-sampling the positive observations.
> >
> > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis <[email protected]> 
> > wrote:
> > Greetings,
> >
> > I want to design a program that can deal with classification problems of 
> > the same type, where the  number of positive observations is small but the 
> > number of negative much larger. Speaking with numbers, the number of 
> > positive observations could range usually between 2 to 20 and the number of 
> > negative could be at least x30 times larger. The number of features could 
> > be between 2 and 20 too, but that could be reduced using feature selection 
> > and elimination algorithms. I 've read in the documentation that some 
> > algorithms like the SVM are still effective when the number of dimensions 
> > is greater than the number of samples, but I am not sure if they are 
> > suitable for my case. Moreover, according to this Figure, the Nearest 
> > Neighbors is the best and second is the RBF SVM:
> >
> > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png
> >
> > However, I assume that Nearest Neighbors would not be effective in my case 
> > where the number of positive observations is very low. For these reasons I 
> > would like to know your expert opinion about which classification algorithm 
> > should I try first.
> >
> > thanks in advance
> > Thomas
> >
> >
> > --
> > ======================================================================
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: [email protected]
> >           [email protected]
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > --
> >
> > Fernando Marcos Wittmann
> > MS Student - Energy Systems Dept.
> > School of Electrical and Computer Engineering, FEEC
> > University of Campinas, UNICAMP, Brazil
> > +55 (19) 987-211302
> >
> > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or 
> > opening attachments.
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081, 
> 62500 Brno, Czech Republic 
> 
> email: [email protected]
>               [email protected]
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> <PCA_of_features.png>_______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to