Markus Quandt wrote:
> Well, I said that I wanted an intuitive explanation. But at the same
time, I
> need an idea where and why the estimation procedure goes wrong, so
pointing at
> some specific formulas/terms might be very helpful. I have tried to
restate my
> exact problems in my reply to Bob Hayden (next thread, same
heading). In
> particular, one of the problems I have with explanations like the
above is why
> the wide scatter of residuals and the reflection of those in the
estimated
> parameters is considered _undue_.

    Suppose that there are few data in the widely-scattered region.
Sampling variation may well put most of them at one extreme or the
other, and the regression diagnostics will not indicate any cause for
concern (apart from the possible indication, by the residual plot, of
heteroscedasticity - this is your only warning, if you get any!)


    Here's a simulation in MINITAB:


MTB > set c1                 #X
DATA> 1 1 1 1 5 5 9 9 9 9
DATA> end                    #a reasonable small-sample experimental
design
MTB > rand 10 c2             #N(0,1)
MTB > let c3 = c2*c1         #model m=b=0, eror proportional to X
MTB > plot c3*c1

    The output of a typical plot:

Plot


 C3      -    2                         *
         -    *
         -
      0.0+                                                         *
         -    *                                                    *
         -
         -
         -
     -2.5+
         -
         -
         -                              *                          *
         -
     -5.0+
         -
         -                                                         *
         -
           ------+---------+---------+---------+---------+---------+C1
               1.5       3.0       4.5       6.0       7.5       9.0


The regression for the same repetition:

MTB > regress c3 1 c1.

Regression Analysis


The regression equation is
C3 = 1.12 - 0.417 C1

Predictor        Coef       StDev          T        P
Constant        1.116       1.241       0.90    0.395
C1            -0.4172      0.2019      -2.07    0.073

S = 2.284       R-Sq = 34.8%     R-Sq(adj) = 26.7%

Analysis of Variance

Source            DF          SS          MS         F        P
Regression         1      22.276      22.276      4.27    0.073
Residual Error     8      41.726       5.216
Total              9      64.002



And here is a histogram of the p-values for the test of (m=0), on 20
repetitions; this is clearly far from U[0,1].

Histogram of C6   N = 20

Midpoint        Count
     0.0            1  *
     0.1            3  ***
     0.2            4  ****
     0.3            4  ****
     0.4            2  **
     0.5            1  *
     0.6            0
     0.7            0
     0.8            0
     0.9            3  ***
     1.0            2  **

    Much of the time the (Y|X=9) values have a mean well away from 0;
and the
regression formulae, which pool variance, make this appear to be much
more
significant than it truly is. (With many more data the inflated
statistical
significance will still be there, but the apparent practical
significance
will go away.)

        -Robert Dawson






===========================================================================
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================

Reply via email to