Re: [scikit-learn] suggested classification algorithm

Dale T Smith Thu, 17 Nov 2016 06:36:05 -0800

The problem with your analysis is it doesn’t include anything but features. You 
may want to look at Nina Zumel and John Mount’s work on y-aware PCR and PCA, as 
well as y-aware feature scaling.


http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/

http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/

http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/


__________________________________________________________________________________________________________________________________________
Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
5985 State Bridge Road, Johns Creek, GA 30097 | [email protected]

From: scikit-learn 
[mailto:[email protected]] On Behalf Of 
Thomas Evangelidis
Sent: Thursday, November 17, 2016 9:01 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] suggested classification algorithm

⚠ EXT MSG:

Guys thank you all for your hints! Practical experience is irreplaceable that's 
why I posted this query here. I could read all week the mailing list archives 
and the respective internet resources but still not find the key info I could 
potentially get by someone here.

I did PCA on my training set (this one has 24 positive and 1278 negative 
observation) and projected the 19 features on the first 2 PCs, which explain 
87.6 % of the variance in the data. Does this plot help to decide which 
classification algorithms and/or over- or under-sampling would be more suitable?

https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png

thanks for your advices
Thomas


On 16 November 2016 at 22:20, Sebastian Raschka 
<[email protected]<mailto:[email protected]>> wrote:
Yeah, there are many useful resources and implementations scattered around the 
web. However, a good, brief overview of the general ideas and concepts would be 
this one, for example: http://www.svds.com/learning-imbalanced-classes/


> On Nov 16, 2016, at 3:54 PM, Dale T Smith 
> <[email protected]<mailto:[email protected]>> wrote:
>
> Unbalanced class classification has been a topic here in past years, and 
> there are posts if you search the archives. There are also plenty of 
> resources available to help you, from actual code on Stackoverflow, to papers 
> that address various ideas. I don’t think it’s necessary to repeat any of 
> this on the mailing list.
>
>
> __________________________________________________________________________________________________________________________________________
> Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science
> 5985 State Bridge Road, Johns Creek, GA 30097 | 
> [email protected]<mailto:[email protected]>
>
> From: scikit-learn 
> [mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>[email protected]<mailto:[email protected]>]
>  On Behalf Of Fernando Marcos Wittmann
> Sent: Wednesday, November 16, 2016 3:11 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] suggested classification algorithm
>
> ⚠ EXT MSG:
> Three based algorithms (like Random Forest) usually work well for imbalanced 
> datasets. You can also take a look at the SMOTE technique 
> (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for 
> over-sampling the positive observations.
>
> On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis 
> <[email protected]<mailto:[email protected]>> wrote:
> Greetings,
>
> I want to design a program that can deal with classification problems of the 
> same type, where the  number of positive observations is small but the number 
> of negative much larger. Speaking with numbers, the number of positive 
> observations could range usually between 2 to 20 and the number of negative 
> could be at least x30 times larger. The number of features could be between 2 
> and 20 too, but that could be reduced using feature selection and elimination 
> algorithms. I 've read in the documentation that some algorithms like the SVM 
> are still effective when the number of dimensions is greater than the number 
> of samples, but I am not sure if they are suitable for my case. Moreover, 
> according to this Figure, the Nearest Neighbors is the best and second is the 
> RBF SVM:
>
> http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png
>
> However, I assume that Nearest Neighbors would not be effective in my case 
> where the number of positive observations is very low. For these reasons I 
> would like to know your expert opinion about which classification algorithm 
> should I try first.
>
> thanks in advance
> Thomas
>
>
> --
> ======================================================================
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: [email protected]<mailto:[email protected]>
>           [email protected]<mailto:[email protected]>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]<mailto:[email protected]>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Fernando Marcos Wittmann
> MS Student - Energy Systems Dept.
> School of Electrical and Computer Engineering, FEEC
> University of Campinas, UNICAMP, Brazil
> +55 (19) 987-211302<tel:%2B55%20%2819%29%20987-211302>
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
> attachments.
> _______________________________________________
> scikit-learn mailing list
> [email protected]<mailto:[email protected]>
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn



--

======================================================================

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic


email: [email protected]<mailto:[email protected]>

          [email protected]<mailto:[email protected]>

website: https://sites.google.com/site/thomasevangelidishomepage/

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
attachments.

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] suggested classification algorithm

Reply via email to