(Posted to edstat only, given the querent's spam-blocking address.) The assumptions necessary for your understanding to be correct (as you've described it, anyway) are rather seldom met in practice.
On Sat, 18 Oct 2003, Oliver Dain <[EMAIL PROTECTED]> wrote: > I think I understand this but I wanted to check: performing a > regression by minimizing sum of squared errors produces a curve that > goes through the mean of the dependent variable at each value of the > independent variable (e.g. if my regression produces y=f(x) then > f(x)=E[Y|X=x]). ONLY IF the regression model is in fact true. (One cannot in general know this.) Even then, the conclusion you state may apply to the population from which your data were putatively drawn, but (because of sampling variations) not to your particular data set. > Performing a regression by minimizing the sum of absolute errors > results in a curve that goes through the median value of the dependent > variable. Presumably you mean "median values" (plural) and are referring to the conditional medians for each value of X. Same proviso as above. But there's an additional point, not commonly realized: the curve in question may not be unique. For a simple example, consider these data: X = 1, 1, 5, 5; Y = 2, 5, 4, 7; respectively. A plot of Y vs. X shows a parallellogram with vertical sides whose four corners are these four points. Lines that minimize the absolute errors in Y include, but are not limited to (there's an infinite number of 'em) one that goes through the two points (1,5) and (5,4) (a line of negative slope); another that goes through (1,4.5) and (5,4.5) (a horizontal line); another that goes through (1,2) and (5,7) (a line of positive slope). Of course one can argue that this is a consequence of the fact that the conditional medians are not unique, where "median" is defined as a number that minimizes the sum of absolute errors, with no other side conditions. (Computer programs usually are set up to impose enough of a side condition to produce a unique result: e.g., when a median is found to lie between two values in the data, the conventional procedure is to select the midpoint between those values; but this convention is not really part of the definition of a median (in the sense in which you appear to want to use it), and is used only for the convenience of reporting one value rather than a range of values. Sometimes the range would be more informative...) A least-squares criterion, on the other hand, always produces a unique result: for the mean, for the regression parameters, ... . > If this is correct then an absolute error and a mean error regression > should produce exactly the same results if the errors are normally > distributed (as the mean and the median are the same) given a suitably > large sample. This may be true with respect to distribution theory, and for a population. However, sample data are NEVER normally distributed (except sometimes approximately); and if they are it is only because the functional model (whose parameters one is estimating via regression) is correct; which, as remarked above, you cannot know in advance. > However, while both absolute and sum of squared regressions produce an > unbiased estimate of the mean (assuming normally distributed errors) > the squared error approach produces a more efficient estimate (in > other words both will converge to the mean as sample size is increased > but the squared error estimate will converge more quickly). Here you speak of estimating a mean (or median), and of efficiency in estimating that mean (or median). But if THAT's all you want to do, why go to the bother of regression analysis? The usual univariate estimates (sample mean, sample median) will do as well. A regression, of whatever stripe, is used to estimate the parameters of a mathematical model relating the response variable to the predictor(s). For me to try to interpret your paragraph in a way that would make sense, would be for me to put words in your mouth. Better for you to do that, especially as doing so may help clarify your thought on all this. > Do I have any of this right? Some, I think; but it's not altogether clear which part(s), in the absence of any contextual information. ----------------------------------------------------------------------- Donald F. Burrill [EMAIL PROTECTED] 56 Sebbins Pond Drive, Bedford, NH 03110 (603) 626-0816 . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
