On Wed, Nov 19, 2008 at 2:19 AM, markus <[EMAIL PROTECTED]> wrote: > Hi all at the R-teaching mailing list, > I am currently preparing my first R-based regression course. Along this > way I encountered the following problem:
> I want to simulate multivariate data that has some specific predefined > attributes. For example I want to produce a Predictor-matrix (X) > and a response-vector (y) that will yield a given vector of regression > coefficients (b) and a given R2 when I perform a multivariate linear > Regression > on the dataset. This would be best described by the well known equation > y=X*b+e. > In some next step I also want to simulate polynomic relationships, but I > think that should work not very different. Do you want to simulate data such that the least squares estimates of the regression coefficients are exactly b and the R2 is exactly the value you specify or do you want to simulate data according to a model for which the "true but unknown" regression coefficients are b and the variance of the random noise is a particular value? The second scenario is easier than the first but both are possible. To simulate from a "true" model X %*% beta + epsilon where Var(epsilon) = sigma^2 * diag(n) you simply add random noise to the vector of true responses. Because the lm function in R can take a matrix of responses (each column corresponding to a response vector) it is best to simulate a matrix of y values as # assign r to be the number of replicates desired n <- nrow(X) ymat <- X %*% beta + matrix(rnorm(n * r, sd = sigma), nrow = n) If you want the second scenario where you simulate data such that the least squares estimates are exactly b (or as close to b as floating point computation allows) then you should use the QR decomposition of X. The Q matrix from QR decomposition is an orthogonal matrix corresponding to a rigid transformation of the response space after which the part determining the coefficients and the part corresponding to the noise are different groups of elements. Under that basis you can establish the required coefficients and a noise term of exactly the desired length. > I already searched the web and found some hints, but no clear answer. There > is a pdf out there from John H. Walker (Teaching Regression with simulation) > which does however not discuss this special topic. I also have a Paper from > K.Baumann 'Chance Correlation in variable subset regression: Influence of > the objective function, selection mechanism and Ensemble averaging' QCS, > 2005. There an 'Autoregressive process' is used to simulate such data. > > Now my question is: > Is it really that difficult to simulate such data? Is there perhaps a > package in R facilitating at least parts of this work? > > Thanks in advance for the help, > Markus > > _______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-sig-teaching > _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
