Everyone, thanks for your help! I'm going to work on this more over the weekend.
SFK On Wed, Mar 25, 2009 at 11:34:38PM +0100, Achim Zeileis wrote: > From: Achim Zeileis <[email protected]> > Date: Wed, 25 Mar 2009 23:34:38 +0100 (CET) > To: Douglas Bates <[email protected]> > Cc: "G. Jay Kerns" <[email protected]>, > "Scott F. Kiesling" <[email protected]>, > [email protected], > Bettina Gruen <[email protected]> > Subject: Re: [R-sig-teaching] Creating data > On Wed, 25 Mar 2009, Douglas Bates wrote: > >> On Wed, Mar 25, 2009 at 1:46 PM, G. Jay Kerns <[email protected]> wrote: >>> Dear Scott, >>> On Tue, Mar 24, 2009 at 4:04 PM, Scott F. Kiesling <[email protected]> >>> wrote: >>>> For a simple example, let's say I want to create a dataset with >>>> students from different countries and academic departments who took an >>>> English test. I want to make some differences (significant and not) >>>> and possibly even interactions among the scores by country and >>>> department. >>> I've been meaning to do this for a while (for quizzes, demos, etc) but >>> didn't yet have the chance. Below is something I quickly put >>> together; I am sure that it can be improved. >>> You seemed to be looking for only one dependent variable but in my >>> case I am looking for several. >>> Dep. vars: y1 - y4 (correlated) >>> Fixed effects: dept (4 levels) and region (3 levels). >>> Covariate: x_cov >>> The model is Y = X*B + Eps: >>> X is the model matrix: ~ x_cov + region + dept + region:dept >>> B is the parameter matrix >>> Eps ~ multivariate normal (or t), mean 0, variance/covariance Sigma >>> I used deviation contrasts; it would be easy to use treatment contrasts, >>> too. >>> If multicollinearity is desired, the same trick as for Eps can be used >>> to generate multiple correlated x-variables. >>> I didn't spend time thinking about the parameters - clearly the most >>> important part (after all, there are 50+ of them in a bloated model >>> like this). They are randomly generated which means that they will >>> display *radically* different behavior, even on different runs. In >>> practice, one would want to choose a parameter matrix thoughtfully and >>> stick with it. >>> Everything is in BigData, at the end. >> Slightly off-topic but you may want to look at the 'exams' package by >> Achim Zeileis and Bettina Gruen >> http://cran.r-project.org/web/packages/exams/ >> The vignette for the package gives examples of incorporating the data >> set generating process into the document that will generate the exam >> questions. Even if you do not choose to use LaTeX to generate the >> exam it is still helpful to encapsulate the data generation process in >> something beyond an R script with some comments. I know for myself >> that even with comments in a script I soon forget exactly what is >> being simulated in a script with a name like 'simulate.R'. > > Thanks for including us and advertising our package, Doug! > > Actually, we use our "exams" package to generate simple two-way ANOVA > examples in our stats exams at WU Wien. The approach is very similar to > the one previously suggested: generate some random regressors, compute > the model matrix, draw random coefficients, set certain coefficients to > zero (if there should be only some main effects) and then simulate the > final data. > > For a one-way ANOVA, there is an example in the package, see > library("exams") > exams("anova") > and see the underlying code in ~/inst/exercises/anova.Rnw. > > hth, > Z > >>> # need pkgs corpcor, sfsmisc, mvtnorm >>> set.seed(1) >>> n <- 500 # sample size >>> # generate indep. vars >>> # covariate >>> x_cov <- rnorm(n, mean = 37, sd = 15) >>> # fixed effects >>> region <- sample(c("Amer", "Euro", "Asia"), size = n, prob = c(7,9,4), >>> replace = TRUE) >>> dept <- sample(LETTERS[1:4], size = n, prob = c(2,3,3,4), replace = TRUE) >>> region <- factor(region) >>> dept <- factor(dept) >>> D <- data.frame(x_cov = x_cov, region = region, dept = dept) >>> head(D) # the data >>> # model matrix >>> X <- model.matrix( ~ x_cov + (region + dept)^2, >>> data = D, >>> contrasts = list( region="contr.sum", dept="contr.sum")) >>> head(X) # the model matrix >>> # num of parameters, for each dep variable >>> k <- dim(X)[2] >>> # specify parameters >>> # should think about these >>> b1 <- rnorm(k, sd = 6) >>> b2 <- rcauchy(k, scale = 9) >>> b3 <- rnorm(k, sd = 12) >>> b4 <- rcauchy(k, scale = 15) >>> B <- matrix( c(b1, b2, b3, b4), ncol = 4) >>> B # the parameter matrix >>> # generate error >>> # a correlation matrix >>> r1 <- c( 1.0, 0.4, -0.5, 0.7 ) >>> r2 <- c( 0.4, 1.0, 0.2, -0.1 ) >>> r3 <- c( -0.5, 0.2, 1.0, 0.0 ) >>> r4 <- c( 0.7, -0.1, 0.0, 1.0 ) >>> Corr <- matrix( c(r1, r2, r3, r4), ncol = 4) >>> library(sfsmisc) >>> Corr <- posdefify(Corr) >>> # variances >>> # tune Y correlation with the 1750 >>> eps_sigma <- 1750*c(15, 2, 21, 5) >>> library(corpcor) >>> Sigma <- rebuild.cov(r = Corr, v = eps_sigma) >>> library(mvtnorm) >>> Eps <- rmvnorm(n, sigma = Sigma) >>> #Eps <- rmvt(n, df = 2, sigma = Sigma) >>> # put it all together >>> Y <- data.frame( X %*% B + Eps ) >>> names(Y) <- c("y1", "y2", "y3", "y4") >>> BigData <- cbind(Y,D) >>> head(BigData) >>> *************************************************** >>> G. Jay Kerns, Ph.D. >>> Associate Professor >>> Department of Mathematics & Statistics >>> Youngstown State University >>> Youngstown, OH 44555-0002 USA >>> Office: 1035 Cushwa Hall >>> Phone: (330) 941-3310 Office (voice mail) >>> -3302 Department >>> -3170 FAX >>> E-mail: [email protected] >>> http://www.cc.ysu.edu/~gjkerns/ >>> _______________________________________________ >>> [email protected] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-sig-teaching -- Scott F. Kiesling, PhD Associate Professor Department Chair Department of Linguistics University of Pittsburgh, 2816 CL Pittsburgh, PA 15260 http://www.linguistics.pitt.edu Office: +1 412-624-5916 _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-teaching
