Hi J.B and the list. Please accept my apology for a much delayed response. Was ill for last few days and did not access my email. Thanks for your detailed response.
> What does "performing better" mean in this case? How are you defining > performance? A particular metric such as MCC, PPV, or NPV? I was looking at precision recall. I have a huge class imbalance [positive class is much smaller than negative class]. So, I am testing performance of various classifiers with an increasing negative set size ( every time I am randomly selecting a larger negative set ). It seems SVC shows better performance in Precision recall space ( SVC precision recall curve is above RFC curve ). Because of the two following issues : 1. I have a major class imbalance 2. Some of my positive observations are sometimes tightly packed within negative observation clusters [ Observations from 2 dimensional PCA and tSNE plot ]. My aim is to obtain a very clean set of positive predictions as a trade-off I am happy to sacrifice some of the positive observations > Also, how is the cross-validation being done - is the data shuffled before > creating train/test groups are created? Is the exact same split of training > and test data per fold used for both > SVC and RF? I am currently testing it. Thanks for the suggestion. > Normalizing your continuous values seems quite fine, but consider these > aspects: > --Does it make sense in the domain of your problem to Z-normalize the > integral (integer-valued) descriptors/features? > For the integral values, would subtracting about the median value make more > sense? This is similar to the previous consideration. Yes. Z-score normalisation does not make much sense. Thanks for pointing it out. Currently testing it. > --What happens to SVC if you don't normalise? SVC performs quite badly. > --What happens to RF if you do normalise? This is interesting. My understating was that decision tree based algorithms does not require normalised data. I took your suggestion and tested an RFC with and without normalised data. Their result [Confusion matrix at 0.5 operating point] seems to be identical. It felt odd to me. I have only tested on a small data set. Currently running it on different data sets to see if this is persistent. Would you have expected this ? > If you can provide a good amount of concrete data to present along with your > "problem", this community is excellent at providing intelligent, helpful > responses. Thanks a lot for your suggestion. I will try to create some example data sets and results from the current analysis and post it as soon as possible. Thanks in advance for your help. Regards, Mamun > Today's Topics: > > 1. SVC data normalisation (Mamun Rashid) > > Message: 2 > Date: Mon, 8 May 2017 21:48:28 +0900 > From: "Brown J.B." <jbbr...@kuhp.kyoto-u.ac.jp> > > Dear Mamun, > > *A.* 80% features are binary [ 0 or 1 ] >> *B.* 10% are integer values representing counts / occurrences. >> *C.* 10% are continuous values between different ranges. >> >> My prior understanding was that decision tree based algorithms work better >> on mixed data types. In this particular case I am noticing >> SVC is performing much better than Random forest. >> > > What does "performing better" mean in this case? > How are you defining performance? > A particular metric such as MCC, PPV, or NPV? > > Also, how is the cross-validation being done - is the data shuffled before > creating train/test groups are created? > Is the exact same split of training and test data per fold used for both > SVC and RF? > > >> I Z-score normalise the data before I sent it to support vector >> classifier. >> - Binary features ( type *A) *are left as it it. >> - Integer and Continuous features are Z-score normalised [ ( feat - >> mean(feat) ) / sd(feat) ) . >> > > Normalizing your continuous values seems quite fine, but consider these > aspects: > --Does it make sense in the domain of your problem to Z-normalize the > integral (integer-valued) descriptors/features? > --For the integral values, would subtracting about the median value make > more sense? This is similar to the previous consideration. > --What happens to SVC if you don't normalize? > --What happens to RF if you do normalize? > > While my various comments above are all geared toward empirical aspects and > not toward theoretical aspects, picking some of them to explore is likely > to help you gain practical insight on your situation/inquiry. > I'm sure you already know this, but while machine learning may have some > "practical guidelines for best practices", they are guidelines and not hard > rules. > So, again, I would recommend doing some more empirical tests and > re-evaluating your situation once you have new data in hand. > > If you can provide a good amount of concrete data to present along with > your "problem", this community is excellent at providing intelligent, > helpful responses. > > Hope this helps. > > J.B. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/e538f9fc/attachment-0001.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 14, Issue 6 > ******************************************* > > Date: Mon, 8 May 2017 10:45:26 +0100 > From: Mamun Rashid <mamunbabu2...@gmail.com> > Subject: [scikit-learn] SVC data normalisation > > > Hi All, > I am testing two classifiers [ 1. Random forest 2. SVC with radial basis > kernel ] on a data set via 5 fold cross validation. > > The feature matrix contains : > > A. 80% features are binary [ 0 or 1 ] > B. 10% are integer values representing counts / occurrences. > C. 10% are continuous values between different ranges. > > My prior understanding was that decision tree based algorithms work better on > mixed data types. In this particular case I am noticing > SVC is performing much better than Random forest. > > I Z-score normalise the data before I sent it to support vector classifier. > - Binary features ( type A) are left as it it. > - Integer and Continuous features are Z-score normalised [ ( feat - > mean(feat) ) / sd(feat) ) . > > I was wondering if anyone can tell me if this normalisation approach it > correct for SVC run. > > Thanks in advance for your help. > > Regards, > Mamun > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/f991b267/attachment-0001.html> > > ------------------------------
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn