(Posted to edstat only, given the querent's spam-blocking address.)

The assumptions necessary for your understanding to be correct (as
you've described it, anyway) are rather seldom met in practice.

On Sat, 18 Oct 2003, Oliver Dain <[EMAIL PROTECTED]> wrote:

> I think I understand this but I wanted to check: performing a
> regression by minimizing sum of squared errors produces a curve that
> goes through the mean of the dependent variable at each value of the
> independent variable (e.g. if my regression produces y=f(x) then
> f(x)=E[Y|X=x]).

ONLY IF the regression model is in fact true.  (One cannot in general
know this.)  Even then, the conclusion you state may apply to the
population from which your data were putatively drawn, but (because of
sampling variations) not to your particular data set.

> Performing a regression by minimizing the sum of absolute errors
> results in a curve that goes through the median value of the dependent
> variable.

Presumably you mean "median values" (plural) and are referring to the
conditional medians for each value of X.  Same proviso as above.  But
there's an additional point, not commonly realized:  the curve in
question may not be unique.  For a simple example, consider these data:
   X = 1, 1, 5, 5;  Y = 2, 5, 4, 7;  respectively.  A plot of Y vs. X
shows a parallellogram with vertical sides whose four corners are these
four points.  Lines that minimize the absolute errors in Y include, but
are not limited to (there's an infinite number of 'em) one that goes
through the two points (1,5) and (5,4) (a line of negative slope);
another that goes through (1,4.5) and (5,4.5) (a horizontal line);
another that goes through (1,2) and (5,7) (a line of positive slope).
  Of course one can argue that this is a consequence of the fact that
the conditional medians are not unique, where "median" is defined as a
number that minimizes the sum of absolute errors, with no other side
conditions.  (Computer programs usually are set up to impose enough of a
side condition to produce a unique result:  e.g., when a median is found
to lie between two values in the data, the conventional procedure is to
select the midpoint between those values;  but this convention is not
really part of the definition of a median (in the sense in which you
appear to want to use it), and is used only for the convenience of
reporting one value rather than a range of values.  Sometimes the range
would be more informative...)
  A least-squares criterion, on the other hand, always produces a unique
result:  for the mean, for the regression parameters, ... .

> If this is correct then an absolute error and a mean error regression
> should produce exactly the same results if the errors are normally
> distributed (as the mean and the median are the same) given a suitably
> large sample.

This may be true with respect to distribution theory, and for a
population.  However, sample data are NEVER normally distributed (except
sometimes approximately);  and if they are it is only because the
functional model (whose parameters one is estimating via regression) is
correct;  which, as remarked above, you cannot know in advance.

> However, while both absolute and sum of squared regressions produce an
> unbiased estimate of the mean (assuming normally distributed errors)
> the squared error approach produces a more efficient estimate (in
> other words both will converge to the mean as sample size is increased
> but the squared error estimate will converge more quickly).

Here you speak of estimating a mean (or median), and of efficiency in
estimating that mean (or median).  But if THAT's all you want to do, why
go to the bother of regression analysis?  The usual univariate estimates
(sample mean, sample median) will do as well.  A regression, of whatever
stripe, is used to estimate the parameters of a mathematical model
relating the response variable to the predictor(s).  For me to try to
interpret your paragraph in a way that would make sense, would be for me
to put words in your mouth.  Better for you to do that, especially as
doing so may help clarify your thought on all this.

> Do I have any of this right?

Some, I think;  but it's not altogether clear which part(s), in the
absence of any contextual information.

 -----------------------------------------------------------------------
 Donald F. Burrill                                         [EMAIL PROTECTED]
 56 Sebbins Pond Drive, Bedford, NH 03110                 (603) 626-0816
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to