Re: [scikit-learn] scikit-learn Digest, Vol 14, Issue 6

Mamun Rashid Fri, 19 May 2017 03:12:53 -0700

Hi J.B and the list.

Please accept my apology for a much delayed response. Was ill for last few days 
and did not access my email. 
Thanks for your detailed response.


> What does "performing better" mean in this case? How are you defining 
> performance? A particular metric such as MCC, PPV, or NPV?

I was looking at precision recall. I have a huge class imbalance [positive 
class is much smaller than negative class]. 
So, I am testing performance of various classifiers with an increasing negative 
set size ( every time I am randomly selecting a larger negative set ). 
It seems SVC shows better performance in Precision recall space ( SVC precision 
recall curve is above RFC curve ). 

Because of the two following issues : 
1. I have a major class imbalance 
2.  Some of my positive observations are sometimes tightly packed within 
negative observation clusters [ Observations from 2 dimensional PCA and tSNE 
plot ]. 

My aim is to obtain a very clean set of positive predictions as a trade-off I 
am happy to sacrifice some of the positive observations


> Also, how is the cross-validation being done - is the data shuffled before 
> creating train/test groups are created? Is the exact same split of training 
> and test data per fold used for both
> SVC and RF?

I am currently testing it. Thanks for the suggestion. 


> Normalizing your continuous values seems quite fine, but consider these
> aspects:
> --Does it make sense in the domain of your problem to Z-normalize the 
> integral (integer-valued) descriptors/features?
> For the integral values, would subtracting about the median value make more 
> sense?  This is similar to the previous consideration.

Yes. Z-score normalisation does not make much sense. Thanks for pointing it 
out. Currently testing it. 

> --What happens to SVC if you don't normalise?
SVC performs quite badly. 

> --What happens to RF if you do normalise?
This is interesting. My understating was that decision tree based algorithms 
does not require normalised data. I took your suggestion and tested an RFC with 
and without normalised data.
Their result [Confusion matrix at 0.5 operating point] seems to be identical. 
It felt odd to me. I have only tested on a small data set. Currently running it 
on different data sets to see if this is
persistent. Would you have expected this ? 


> If you can provide a good amount of concrete data to present along with your 
> "problem", this community is excellent at providing intelligent, helpful 
> responses.

Thanks a lot for your suggestion. I will try to create some example data sets 
and results from the current analysis and post it as soon as possible. 

Thanks in advance for your help. 

Regards,
Mamun


> Today's Topics:
> 
>   1. SVC data normalisation (Mamun Rashid)

> 
> Message: 2
> Date: Mon, 8 May 2017 21:48:28 +0900
> From: "Brown J.B." <[email protected]>
> 
> Dear Mamun,
> 
> *A.* 80% features are binary [ 0 or 1 ]
>> *B.* 10% are integer values representing counts / occurrences.
>> *C.* 10% are continuous values between different ranges.
>> 
>> My prior understanding was that decision tree based algorithms work better
>> on mixed data types. In this particular case I am noticing
>> SVC is performing much better than Random forest.
>> 
> 
> What does "performing better" mean in this case?
> How are you defining performance?
> A particular metric such as MCC, PPV, or NPV?
> 
> Also, how is the cross-validation being done - is the data shuffled before
> creating train/test groups are created?
> Is the exact same split of training and test data per fold used for both
> SVC and RF?
> 
> 
>> I Z-score normalise the data before I sent it to support vector
>> classifier.
>> - Binary features ( type *A) *are left as it it.
>> - Integer and Continuous features are Z-score normalised [   ( feat -
>> mean(feat) ) / sd(feat)   ) .
>> 
> 
> Normalizing your continuous values seems quite fine, but consider these
> aspects:
> --Does it make sense in the domain of your problem to Z-normalize the
> integral (integer-valued) descriptors/features?
> --For the integral values, would subtracting about the median value make
> more sense?  This is similar to the previous consideration.
> --What happens to SVC if you don't normalize?
> --What happens to RF if you do normalize?
> 
> While my various comments above are all geared toward empirical aspects and
> not toward theoretical aspects, picking some of them to explore is likely
> to help you gain practical insight on your situation/inquiry.
> I'm sure you already know this, but while machine learning may have some
> "practical guidelines for best practices", they are guidelines and not hard
> rules.
> So, again, I would recommend doing some more empirical tests and
> re-evaluating your situation once you have new data in hand.
> 
> If you can provide a good amount of concrete data to present along with
> your "problem", this community is excellent at providing intelligent,
> helpful responses.
> 
> Hope this helps.
> 
> J.B.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/e538f9fc/attachment-0001.html>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ------------------------------
> 
> End of scikit-learn Digest, Vol 14, Issue 6
> *******************************************

> 
> Date: Mon, 8 May 2017 10:45:26 +0100
> From: Mamun Rashid <[email protected]>
> Subject: [scikit-learn] SVC data normalisation
> 
> 
> Hi All,
> I am testing two classifiers  [ 1. Random forest 2. SVC with radial basis 
> kernel ] on a data set via 5 fold cross validation. 
> 
> The feature matrix contains :
> 
> A. 80% features are binary [ 0 or 1 ]
> B. 10% are integer values representing counts / occurrences.
> C. 10% are continuous values between different ranges.
> 
> My prior understanding was that decision tree based algorithms work better on 
> mixed data types. In this particular case I am noticing 
> SVC is performing much better than Random forest. 
> 
> I Z-score normalise the data before I sent it to support vector classifier. 
> - Binary features ( type A) are left as it it. 
> - Integer and Continuous features are Z-score normalised [   ( feat - 
> mean(feat) ) / sd(feat)   ) . 
> 
> I was wondering if anyone can tell me if this normalisation approach it 
> correct for SVC run.
> 
> Thanks in advance for your help. 
> 
> Regards,
> Mamun
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mail.python.org/pipermail/scikit-learn/attachments/20170508/f991b267/attachment-0001.html>
> 
> ------------------------------

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] scikit-learn Digest, Vol 14, Issue 6

Reply via email to