Prof. Jeffrey Cardille wrote: > Hello, > > Is there an accepted way to convey, for regression trees, something > akin to R-squared? > > I'm developing regression trees for a continuous y variable and I'd > like to say how well they are doing. In particular, I'm analyzing the > results of a simulation model having highly non-linear behavior, and > asking what characteristics of the inputs are related to a particular > output measure. I've got a very large number of points: n=4000. I'm > not able to do a model sensitivity analysis because of the large > number of inputs and the model run time. > > I've been googling around both on the archives and on the rest of the > web for several hours, but I'm still having trouble getting a firm > sense of the state of the art. Could someone help me to quickly > understand what strategy, if any, is acceptable to say something like > "The regression tree in Figure 3 captures 42% of the variance"? The > target audience is readers who will be interested in the subsequent > verbal explanation of the relationship, but only once they are > comfortable that the tree really does capture something. I've run > across methods to say how well a tree does relative to a set of trees > on the same data, but that doesn't help much unless I'm sure the > trees in question are really capturing the essence of the system. > > I'm happy to be pointed to a web site or to a thread I may have > missed that answers this exact question. > > Thanks very much, > > Jeff > > ------------------------------------------ > Prof. Jeffrey Cardille > [EMAIL PROTECTED] > R-help@stat.math.ethz.ch mailing list
Ye (below) has a method to get a nearly unbiased estimate of R^2 from recursive partitioning. In his examples the result was similar to using the formula for adjusted R^2 with regression degrees of freedom equal to about 3n/4. You can also use something like 10-fold cross-validation repeated 20 times to get a fairly precise and unbiased estimate of R^2. Frank >@ARTICLE{ye98mea, author = {Ye, Jianming}, year = 1998, title = {On measuring and correcting the effects of data mining and model selection}, journal = JASA, volume = 93, pages = {120-131}, annote = {generalized degrees of freedom;GDF;effective degrees of freedom;data mining;model selection;model uncertainty;overfitting;nonparametric regression;CART;simulation setup} } -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.