As most of the others, I am not sure what you mean (how the data are
generated). But if you do a regression with a binary Y, then actually
this is very much like doing a discriminant analysis or like computing
a t-test (if there is a single X). The R-sq is the ratio of between
variance to total variance. So we can ask what the maximum of this
ratio is over all partitionings into two groups. This is the
cluster analysis method of Ward, and it is known that the optimum
is (as in the figure below) taking the groups as two disjoint X
intervals. I.e. try all n cut-points between successive x(i). If there
is more than one X, then Hartigan's theorem says that the two groups
must be separable by a hyperplane. Finding the hyperplane is equivalent
to doing a k-means cluster analysis with two clusters. These procedures
will give you the maximum R-sq for a given X over all possible binary
y.



At 6:15 PM -0500 2/6/00, Donald F. Burrill wrote:
>On Sun, 6 Feb 2000, Milo Schield wrote:
>
>  > QUESTION:  What is the theoretical maximum value of R-sq ** when binary
>  > data (Y) is obtained from a simple linear model?
>
>Not clear what "obtained from a simple linear model" means.  Are you
>using a model to _generate_ values of Y?  Or are you using such a model
>to _represent_ a relationship between X and Y in "real" data?
>       But, for openers, if you're looking at data that are binary in Y
>and continuous in X, I would expect max R-sq to depend on (a) the
>proportion of Y's that are at one value (or the other;  symmetric about
>0.5) and (b) the degree of separation between values of X for one value
>of Y and values of X for the other value of Y.
>       (I picture something like the following, WLOG choosing Y = {0.1}
>for convenience:
>
>       Y=1 |                            * * *  *** * *  *
>           |
>           |
>       Y=0 | * *  *** * **  *
>           |
>           +-----------------------------------------------
>                               X
>
>The larger the horizontal gap between max (X: Y=0) and min (X: Y=1), the
>greater the value of R-sq, ceteris paribus.  Since you have specified
>that you want the _maximum_ R-sq, we can rule out the situations in which
>the values of X overlap between the groups defined by Y.  You have not,
>however, specified how the conditional distributions of X are to differ
>from each other.)
>
>  > The data is binary with Y values taken from a linear model going from 0
>  > to 1 over the range of X.
>
>Can you specify what model you have in mind, and how Y values are "taken"
>from it?
>
>  > The binary sequences of Y values are organized to minimize* the
>  > standard deviation around the model.
>  >
>  > TYPE REGRESSION            DISTRIBUTION OF X VALUES
>  > a.     OLS                   linear
>  > b.     OLS                   normal [width truncated at 6 sigma?]
>  > c.     Logistic              linear
>  > d.     Logistic              normal [width truncated at 6 sigma?]
>
>       O.K., stop a minute.  I think I know what a normal distribution
>is, arbitrarily truncated or not;  but what is a "linear" distribution?
>Do you perchance mean "uniform" or "rectangular"?  Over what range?
>
>  > Based on some discrete trials, I get the following estimates for R-sq:
>  > a.  99%
>  > b.  16%
>  > c.  96%
>  > d.  16%
>
>I don't see how this can make sense without rather more detail (agreeing,
>as I often do, with Rich Ulrich).  "Some discrete trials"??  (Not,
>doubtless, to be confused with indiscreet trials, nor presumably with
>continuous trials;  but it's still a puzzle what these trials might be.)
>
>  > * On the selection of binary Y values.  Suppose the X values are linearly
>  > distributed from 0 to 1 and the Model is Y=X.  In the discrete 
>case with 100
>  > points, the first 5 would be all zeros and the last 5 would be all ones.
>
>Umm...  I'm sorry, but I don't see why, with 100 points (presumably
>equally spaced?  Is that part of what you meant by "linearly distributed"?),
>the first 50 (not just 5) wouldn't be all zeroes and the last 50 all ones.
>
>  > At the center, half the points would be zeroes and the other half would
>  > be ones.
>
>At the center of what?  If you mean at the center of the 100 points, that
>would presumably comprise 2 points (100 being even;  if it were 101
>points, the center would comprise one point).  What are all these
>"points" (and how many of them are there?) half of which would be zeroes
>etc.?  Are we to understand that the half that are zero are randomly zero
>in some sense of "random"?  Why?
>
>  ------------------------------------------------------------------------
>  Donald F. Burrill                                 [EMAIL PROTECTED]
>  348 Hyde Hall, Plymouth State College,          [EMAIL PROTECTED]
>  MSC #29, Plymouth, NH 03264                                 603-535-2597
>  184 Nashua Road, Bedford, NH 03110                          603-471-7128
>
>
>
>===========================================================================
>   This list is open to everyone. Occasionally, people lacking respect
>   for other members of the list send messages that are inappropriate
>   or unrelated to the list's discussion topics. Please just delete the
>   offensive email.
>
>   For information concerning the list, please see the following web page:
>   http://jse.stat.ncsu.edu/
>===========================================================================

===
Jan de Leeuw; Professor and Chair, UCLA Department of Statistics;
US mail: 8142 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095-1554
phone (310)-825-9550;  fax (310)-206-5658;  email: [EMAIL PROTECTED]
    http://www.stat.ucla.edu/~deleeuw and http://home1.gte.net/datamine/
============================================================================
          No matter where you go, there you are. --- Buckaroo Banzai
============================================================================


===========================================================================
  This list is open to everyone. Occasionally, people lacking respect
  for other members of the list send messages that are inappropriate
  or unrelated to the list's discussion topics. Please just delete the
  offensive email.

  For information concerning the list, please see the following web page:
  http://jse.stat.ncsu.edu/
===========================================================================

Reply via email to