Suggestion below: On Tue, Mar 13, 2012 at 1:24 PM, guillaume chaumet <guillaumechau...@gmail.com> wrote: > I omit to precise that I already try to generate data based on the mean and > sd of two variables. > > x=rnorm(20,1,5)+1:20 > > y=rnorm(20,1,7)+41:60 > > simu<-function(x,y,n) { > simu=vector("list",length=n) > > for(i in 1:n) { > x=c(x,rnorm(1,mean(x),sd(x))) > y=c(y,rnorm(1,mean(y),sd(y))) > simu[[i]]$x<-x > simu[[i]]$y<-y > > > } > > return(simu) > } > > test=simu(x,y,60) > lapply(test, function(x) cor.test(x$x,x$y)) > > As you could see, the correlation is disappearing with increasing N. > Perhaps, a bootstrap with lm or cor.test could solve my problem. >
In this case, you should consider creating the LARGEST sample first, and then remove cases to create the smaller samples. The problem now is that you are drawing a completely fresh sample every time, so you are getting not only the effect of sample size, but also that extra randomness when case 1 is replaced every time. I am fairly confident (80%) that if you approach it my way, the mystery you see will start to clarify itself. That is, draw the big sample with the desired characteristic, and once you understand the sampling distribution of cor for that big sample, you will also understand what happens when each large sample is reduced by a few cases. BTW, if you were doing this on a truly massive scale, my way would run much faster. Allocate memory once, then don't need to manually delete lines, just trim down the index on the rows. (Same data access concept as bootstrap). pj -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.