Dear R-helpers,

I'm working on mass spectra in randomForest/R, and following the recommendations for the case of noisy variables, I don't want to use the default mtry (sqrt of nvariables), but I'm not sure up to which proportion mtry/nvariables it makes sense to increase mtry without "overtuning" RF.
Let me tell my example: I have 106 spectra belonging to 4 classes, the number of variables is 172. I'm interested in finding information about variables (importance, split points etc.) and proximities.
First I ran a forest with mtry =30 and ntree=2500. The result was an oob-estimate of overall error rate of zero, perfect classification. In order to explore my results, I calculated the average proximity between the classes. I got:
> res
op12 op13 op14 op23 op24 op34
[1,] 0.06145473 0.1369406 0.08036264 0.06171053 0.1113126 0.06732087
For me, the important meaning of these values is that from comparision of class 1 and 3, as well as class 2 and 4 result more common features than from other comparisions. I have worked yet a lot about these data, I have looked a lot on my spectra, and I believe these proximities to be realistic.


Then I ran the tune RF function(step factor 1.5), I got out an mtry=63. A new forest having this mtry and 2500 trees gave me perfect classification as well, but the relation between proximitiy values changed a lot:
res
op12 op13 op14 op23 op24 op34
[1,] 0.1092702 0.117489 0.09696328 0.08725208 0.08495621 0.06506148


This is what makes me think that I have overtuned my second forest...So how should I choose mtry?

Best regards,
Ute

______________________________________________
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to