You may generate a single standard normal random variable Z and set X = (Z>x). According to back-of-the-envelope calculations, the two have a correlation of
exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x))) which goes from a maximum of about 0.79 at x=0 to 0 for x going to infinity. If you aim at a correlation of 0.7, this is how to go: > rt <- uniroot(function(x) exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x)))-0.7, + lower=0,upper=3) > rt$root [1] 0.841188 > x <- rnorm(1000) > y <- as.numeric(x>rt$root) > cor(x,y) [1] 0.7069753 Best, Giovanni -- __________________________________________________ [ ] [ Giovanni Petris [EMAIL PROTECTED] ] [ Department of Mathematical Sciences ] [ University of Arkansas - Fayetteville, AR 72701 ] [ Ph: (479) 575-6324, 575-8630 (fax) ] [ http://definetti.uark.edu/~gpetris/ ] [__________________________________________________] > Date: Sun, 17 Apr 2005 11:52:49 +0100 (BST) > From: [EMAIL PROTECTED] (Ted Harding) > Sender: [EMAIL PROTECTED] > Cc: r-help@stat.math.ethz.ch > Precedence: list > > On 16-Apr-05 Ashraf Chaudhary wrote: > > Ted: > > Thank you for your help. All I want is a binomial random > > variable that is correlated with a normal random variable > > with specified correlation. By linear I mean the ordinary > > Pearson correlation. I tried the following two methods, > > in each case the resulting correlation is substantially > > less than the one specified. > > > > Method I: Generate two correlated normals (using Cholesky > > decomposition method) and dichotomize one (0/1) to get the > > binomial. Method II: Generate two correlated variables, one > > binomial and one normal using the Cholesky decomposition methods. > > > > Here is how I did: > > > > X <- rnorm(100) > > Y <- rnorm(100) > > r<- 0.7 > > Y1 <- X*r+Y*sqrt(1-r**2) > > cor(X,Y1) # Correlated normals using Cholesky decomposition > > cor(X>0.84,Y1) # Method I > > > >## > > X1 <- rbinom(100,1,0.5) > > Y2 <- X1*r+Y*sqrt(1-r**2) > > cor(X1,Y2); # Method II > > Hello Ashraf, > > The above more explicit explanation certainly helps to clarify > your question! (I echo Bill's counsel to spell out essential > detail). > > I'll lead on from your "second method" above, which makes it > explicit that, to each of your 100 values (Y) sampled from rnorm > you are sampling from {0,1} to get one of your 100 binomial > (n=1) variables (X1). However, in this second method you also > create the derived variable Y2 = X1*r+Y*sqrt(1-r**2), and > apparently you want this (Y2) to be the Normal variate which > is correlated with X1. > > Unfortunately, given the way Y2 is constructed, it does not > have a Normal distribution. This is almost obvious from the > fact that, conditional on the number of 1s in X1, the values > of Y2 are a mixture of N(0,1-r^2) and N(r,1-r^2). > > You can explore this in R by looking at the histogram of a > much larger sample of Y2. Thus: > > r<- 0.7 > Y <- rnorm(100000) > X1 <- rbinom(100000,1,0.5) > Y2 <- X1*r+Y*sqrt(1-r**2) > m<-mean(Y2) ; s<-sd(Y2) > hist(Y2,breaks=0.1*(-45:45)) > lines(0.1*(-45:45),0.1*100000*dnorm(0.1*(-45:45),m,s)) > > Do it once, and the fit looks pretty good. However, if you > repeat the above commands several times, you will observe > that, near the peak of the curve, there is a definite > tendency for the peak of the histogram to lie below the peak > of the fitted curve. This illustrates the fact that you are > looking at a superposition of two Normal curves, with identical > (since you have p=0.5 in the Binomial sample) proportions > and variances, but slightly shifted relative to each other > (by an amount r which is in magnitude less than 1). This > superposition is flatter at the top than any single Normal > curve would be, which is being reflected in the "deficiency" > of the histogram relative to the fitted curve > > You can also see this theoretically, since your Y2 is > > Y2 = A*X1 + B*Y > > so that > > P[Y2 <= y] = p*P[B*Y <= y] + (1-p)*P[B*Y <= y - A] > > = p*P[Y <= y/B] + (1-p)*P[Y <= (y-A)/B] > > where p is the Binomial P[X1 = 1]. > > Hence the density function of Y2 is the derivative of this > w.r.to y, namely > > p*f(y/B)/B + (1-p)*f((y-A)/B)/B > > where f(x) is the density function of the standard N(0,1). > > While in the case of your example (see histograms) the > distribution of Y2 might be close enough to Normal for > practical purposes (but this depends on your practical > purpose), it is nonetheless better to try to avoid the > theoretical difficulty if possible. > > So I'd suggest experimenting on the following lines. > > 1. Let X1 be a sample of size N using rbinom(N,1,p) > (where, in general, p need not be 0.5) > > 2. Let Y be a sample of size N using rnorm(N,mu,sigma) > (and again, in general, mu need not be 0 nor sigma 1). > > This is as in your example. So far, you have a true > Binomial variate X1 and a true Normal variate Y. > > 3. X1 will have r 1s in it, and (N-r) 0s. Now consider > setting up a correspondence between the 1s and 0s in > X1 and the values in Y, in such a way that the 1s > tend to get allocated to the higher values of Y and > the 0s to lower values of Y. The resulting set of > pairs (X1,Y) is then your correlated sample. > > In other words, permute the sample X1 and cbind the > result to Y. > > This is similar to your "dichotomy" method (and would > be almost identical if you simply allocated the r 1s > to the r largest Ys), but is more flexible since I'm > only saying "tend". In other words, consider sampling > from the N Binomial 0s and 1s according to a non-uniform > probability distribution. > > The theoretical benefit from doing it this way (provided > you can arrange the sampling in (3) so as to get your > desired correlation) is that, by construction, the marginal > distributions of X1 and Y are exactly as they were to start > with, namely Binomial and Normal respectively. > > Here is an example of a possible approach: > > X0 <- rbinom(100,1,0.5) > Y <- rnorm(100) ; Y<-sort(Y) > p0<-((1:100)-0.5)/100 ; p0<-p0/sum(p0); p<-cumsum(p0) > r<-sum(X0); ix<-sample((1:100),r,replace=FALSE,prob=p) > X1<-numeric(100); X1[ix]<-1;sum(X1) > ## [1] 50 > cor(X1,Y) > ## [1] 0.6150937 > > showing that it's quite easy to get close to your (example) > correlation of 0..7 by simple means. (Note that X1, as > constructed above, is equivalent to the original Binomial > sample X0, since it has the same number of 0s and 1s; it's > just a matter of rearanging these so as to tend to line > up with the higher values of Y). > > To refine the method (on this approach) play with the way > that p is derived from p0 (the second line above). You might > look at some of the suggestions in Bill Venables' (private) > mail. > > Even something as banal as > > X1 <- sort(rbinom(100,1,0.5)) > Y <- sort(rnorm(100)) > cor(X1,Y) > ## [1] 0.768373 > > gives an interesting result, though this is far too inflexible > for general purposes. > > There may even be a theoretical way of arranging the 1s > so as to get the desired correlation exactly on average > -- my nose indicates that this may be so, but I have not > followed it! > > Best wishes, > Ted. > > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <[EMAIL PROTECTED]> > Fax-to-email: +44 (0)870 094 0861 > Date: 17-Apr-05 Time: 11:52:49 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html