Re: Max R-sq for Binary Data

Jan de Leeuw Mon, 7 Feb 2000 09:02:39 -0800
This is more slowly, and more clearly as well. I hate to be the one to break
the news, but some time ago Man invented Mathematics to make these descriptions
even clearer.

It seems to me you are saying

prob(Y=1|Z) = .2(Z+2.5)

where Z is continuously distributed between -2.5 and + 2.5 (either as
a truncated normal or as a uniform or as whatever). You are asking about
R^2 in this situation (and perhaps about the maximum and minimum of
R^2 for different distributions of Z).

Thus, for the joint,

prob(Y=1,Z)=(1/2+Z/5)p(z)
prob(Y=0,Z)=(1/2-Z/5)p(z)

which means

prob(Y=1)=(1/2+M/5)
prob(Y=0)=(1/2-M/5)

where M=E(Z). Thus

E(Y) = (1/2+M/5)
V(Y) = (1/2+M/5)(1/2-M/5)
E(ZY)= M/2 + (S^2+M^2)/5

where S^2 = V(Z). Thus

C(ZY) = M/2 + (S^2+M^2)/5 - M(1/2+M/5)=S^2/5

which means

R(XY) = S / 5*SQRT((1/2+M/5)(1/2-M/5)).

I usually make mistakes in these calculations, so this may very well
be wrong, but at least it is an answer. If M=0, then R(ZY)= 2S/5. This
is minimal for a uniform, where R(ZY)=2/sqrt(12)=.577 and maximal for
a two-point distribution where R(ZY)=1.

It is easy to calculate for a truncated normal, but somebody else can do that.

At 5:07 AM +0000 2/7/00, Milo Schield wrote:
>Rich Ulrich wrote in message ...
>  > ...could you explain the question more slowly?
>
>
>Sorry for not giving enough motivation or context.   Yes, I'm not comparing
>the fit for continuous data with the fit for it's binary equivalent.
>
>I have been thinking about the small values of R-squared values for logistic
>regressions involving z (the standard normal) on the horizontal axis.   I
>realize that the binary nature of the data precludes good fits, but that per
>se does not mean that a given regression model can't bring about a
>substantial improvement in fit -- when compared with the fit using just the
>mean as the model.  When data is linearly distributed on the horizontal
>axis, even though the data is binary on the vertical axis, a simple Ordinary
>Least Squares model can bring about a great improvement in the standard
>deviation when comparing the explained [original - unexplained] with the
>original.
>
>I don't have any idea whether a normal distribution of data on the
>horizontal axis would make it very much more difficult to get any
>improvement in modeling binary data on the vertical axis -- due to the
>non-linear weighting on the horizontal axis.  That is what I want to learn.
>
>To simplify matters, I wanted to consider a linear model of the expected
>value of Y with values ranging from 0 to 1.    I'll entertain more complex
>models later.
>
>As a basis for comparison, I began with data linearly distributed on the
>horizontal axis.  In the horizontal linear case , I let X range from 0 to 1
>and let Y_continuous = X.    Y_discrete was assigned so that the fraction of
>ones in an adjacent group of points had a value similar to that of the
>continuous model (Y_c) in that region of X.
>
>In the horizontal normal case, I let Z range from -2.5 to +2.5 and I let
>Y_continuous = .2*(Z + 2.5) so that Y_continuous ranged from 0 to 1 over
>this interval.   I then assigned values to Y_discrete so that the fraction
>of ones in an adjacent group of points had a value similar to that for the
>continuous model (Y_c) in that region of Z.     I presume that the R-squared
>will vary with the range of Z values involved since Z range from -oo to +oo
>might involve some real challenges in fitting a linear distribution over
>that interval.
>
>I presume that there is a theoretical/analytical solution for the value of
>R-squared given an infinite number of points in the intervals on the X axis.
>I expect that my crude discrete simulations give some indication of these
>values.  My question is this:
>
>"What is the limiting value of R-squared as the number of discrete points in
>the X-axis interval approaches infinity WHEN
>1. the data is normally distributed on the X axis [say-2.5 < Z < +2.5] and
>2. the binary data values on the Y axis models a linear distribution going
>from 0 to 1 over the interval on the X axis?"
>
>
>
>
>
>
>===========================================================================
>   This list is open to everyone. Occasionally, people lacking respect
>   for other members of the list send messages that are inappropriate
>   or unrelated to the list's discussion topics. Please just delete the
>   offensive email.
>
>   For information concerning the list, please see the following web page:
>   http://jse.stat.ncsu.edu/
>===========================================================================

===
Jan de Leeuw; Professor and Chair, UCLA Department of Statistics;
US mail: 8142 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095-1554
phone (310)-825-9550;  fax (310)-206-5658;  email: [EMAIL PROTECTED]
    http://www.stat.ucla.edu/~deleeuw and http://home1.gte.net/datamine/
============================================================================
          No matter where you go, there you are. --- Buckaroo Banzai
============================================================================


===========================================================================
  This list is open to everyone. Occasionally, people lacking respect
  for other members of the list send messages that are inappropriate
  or unrelated to the list's discussion topics. Please just delete the
  offensive email.

  For information concerning the list, please see the following web page:
  http://jse.stat.ncsu.edu/
===========================================================================
Re: Max R-sq for Binary Data

Reply via email to