Re: Max R-sq for Binary Data

Milo Schield Sun, 6 Feb 2000 21:42:33 -0800
Rich Ulrich wrote in message ...
> ...could you explain the question more slowly?


Sorry for not giving enough motivation or context.   Yes, I'm not comparing
the fit for continuous data with the fit for it's binary equivalent.

I have been thinking about the small values of R-squared values for logistic
regressions involving z (the standard normal) on the horizontal axis.   I
realize that the binary nature of the data precludes good fits, but that per
se does not mean that a given regression model can't bring about a
substantial improvement in fit -- when compared with the fit using just the
mean as the model.  When data is linearly distributed on the horizontal
axis, even though the data is binary on the vertical axis, a simple Ordinary
Least Squares model can bring about a great improvement in the standard
deviation when comparing the explained [original - unexplained] with the
original.

I don't have any idea whether a normal distribution of data on the
horizontal axis would make it very much more difficult to get any
improvement in modeling binary data on the vertical axis -- due to the
non-linear weighting on the horizontal axis.  That is what I want to learn.

To simplify matters, I wanted to consider a linear model of the expected
value of Y with values ranging from 0 to 1.    I'll entertain more complex
models later.

As a basis for comparison, I began with data linearly distributed on the
horizontal axis.  In the horizontal linear case , I let X range from 0 to 1
and let Y_continuous = X.    Y_discrete was assigned so that the fraction of
ones in an adjacent group of points had a value similar to that of the
continuous model (Y_c) in that region of X.

In the horizontal normal case, I let Z range from -2.5 to +2.5 and I let
Y_continuous = .2*(Z + 2.5) so that Y_continuous ranged from 0 to 1 over
this interval.   I then assigned values to Y_discrete so that the fraction
of ones in an adjacent group of points had a value similar to that for the
continuous model (Y_c) in that region of Z.     I presume that the R-squared
will vary with the range of Z values involved since Z range from -oo to +oo
might involve some real challenges in fitting a linear distribution over
that interval.

I presume that there is a theoretical/analytical solution for the value of
R-squared given an infinite number of points in the intervals on the X axis.
I expect that my crude discrete simulations give some indication of these
values.  My question is this:

"What is the limiting value of R-squared as the number of discrete points in
the X-axis interval approaches infinity WHEN
1. the data is normally distributed on the X axis [say-2.5 < Z < +2.5] and
2. the binary data values on the Y axis models a linear distribution going
from 0 to 1 over the interval on the X axis?"






===========================================================================
  This list is open to everyone. Occasionally, people lacking respect
  for other members of the list send messages that are inappropriate
  or unrelated to the list's discussion topics. Please just delete the
  offensive email.

  For information concerning the list, please see the following web page:
  http://jse.stat.ncsu.edu/
===========================================================================
Re: Max R-sq for Binary Data

Reply via email to