Hi Paige, Comments below:
"> > The data above present one half of a roughly bell shaped frequency > > distribution. It is abundantly clear that the reduction of cell sizes > > reduces the power of the statistics. This fact is also supported by those > > graphs from regression analysis that show the standard error increases as > > the values of the predictor are more extreme. > > I didn't follow this last sentence. What graphs? What standard error? This is pretty standard stuff. For example, on page57 of Kleinbaum and Kuppers book Applied Regressions Analysis..., confidence bands are graphically displayed for a regression model. The bands get wider towards the ends of the regression slope, thus illustrating wider variation in the extremes. The width of the confidence band is a function of the standard error of the estimated y value at each level of the predictor variable. I remember as a student asking Jamie Algina why this occured but did not get an answer, and have not heard one since. Perhaps the band gets wider when the predictor (x) is normally distributed but not when x is uniformly distributed? > > > All of this suggests to me that when ever there is a serious desire to infer > > causation from correlational data, it is reasonable to seek out uniformly > > sampled putative causes. > > This would be ideal, and can be done in designed studies, however many > studies are not really "designed", the data is collected and you have to > live with whatever sample sizes occur. But should you get paid to infer causation from samples that are not sufficient to warrant that inference. Again we come back to practicality versus integrity. > > There was a recent discussion in one of these stat newsgroups about > inferring causation from correlation data. I note that people fall on > both sides of the argument, however, my position is that without subject > matter knowledge, you cannot get to causation, you only have correlation. You are speaking as an authority figures expressing an opinion. I have working and publishing stuff on inference of causation from correlations since 1985 and I am afraid mere opinion does not get very far. Why do you think it is impossible, other than that you have been told by your teachers (who apparently also stress practicality) that is impossible? > > The problem with using corresponding regressions > > with normally distributed causes is that there is not enough information in > > the extremes to reveal the polarization effect. We see that data degradation > > also occurs in the simplest ANOVA designs when the factors are sampled > > normally. This confirms the unity of the general linear model. > > I have no idea what polarization means, nor do I understand the term > "factors are sampled normally". I do not understand "unity of the > general linear model". Forgive me, I thought perhaps you had been following the arguments on corresponding regressions. The general linear model is a model in statistics that integrates both the correlational and ANOVA traditions into a unified set of calculations. I mention it because if we subscribe to the general linear model, then the assumptions we hold for ANOVA should apply for correlation as well. By factors are sampled normally, I mean some idiot goes out and purposefully collects smaller numbers of observations towards the ends of an anova factor and many towards the middle ranges of the factor. Thus, the cell sizes of the factor will be approximately normally distributed. We would ordinarily frown upon someone doing such a thing in a designed experiment but think nothing of the same sort of sampling occuring in correlational studies. > > > I understand your point that the normality assumption applies to the > > dependent variable, at least when F or t are being calculated. > > The normality assumption applies to the errors in the dependent > variable, not the dependent variable itself. Interesting. So why do so many people prefer normally distributed variables? > > > But if y > > values in the extremes of x, have a wider dispersion and hence greater error > > when the cell sizes are normally distributed, it would seem that uniformity > > in the x factor would be the ideal. When we calculate the difference between > > y means, across the levels of x, if the underlying variances are not > > identical, then different standard errors should be assumed per mean. This > > complicates the ANOVA design and the pooling of error variances. Think about > > unequal variances in the t-test. > > > > It may be true that the linear slope calculated on y from x is legitimately > > extrapolated across the ranges of y. But the pattern of deviations about > > that slope is not uniform and thus the inferences of the points along y are > > not based on uniform parameters. I believe this is a well established fact. > > Statistics that require more than theoretically extrapolated slopes, are > > thus compromised by unequal cell sizes. > > Your argument seems to rely on assumptions you make that are not > universally true. "The pattern of deviations about that slope is not > uniform ..." I have many industrial examples where the pattern of > deviations is uniform, regardless of the value of X. So you do know what I mean above when I talk about standard errors etc! Ok, look at the data you mention. Then look at the data in Kleinbaum and Kupper. Are the uniform confidence bands you see in your data derived from designs in which the factors/predictors are sampled uniformly across the levels? > > > My conclusion from all of this is that where SEM users have hypotheses, they > > would best spend the extra time and money uniformly sampling their putative > > causes, so to better represent the causal model empirically. > > Well, now you drag in SEM ... you are really stretching to make a point, > aren't you? SEM is often done on data that is collected based upon > historical studies, where uniform sampling simply isn't possible. What > is your point? You seem to believe you have a superior intellect and trainning. What has really happened is that I am using your insatiable need to show off your knowledge of statistics, to illustrate a point. That point is, that unequal cell sizes create differences in the precision of estimates. Using unequal cell sizes, as is the habit of correlational and SEM people of practical inclincation, builds into the statistics serious problems. Because of near mystical attachments of people to normal distributions, not only have many statistical studies been compromised, but future developments in causal inference are being obstructed. So my wonderful colleague, I will drag the truth into any conversation I enter, without hesitation or apology. Dragging is what one must do when dealing with reticent "professionals" who collude in incompetence and ignorance. > > > Do you agree? > > I don't agree, I don't disagree, to put it simply, I don't follow what > your argument. Would you create an anova design with cell sizes that are normally distributed across the levels of your factors? If so, why? Best, Bill > > -- > Paige Miller > [EMAIL PROTECTED] > http://www.kodak.com > > "It's nothing until I call it!" -- Bill Klem, NL Umpire > "When you get the choice to sit it out or dance, I hope you dance" -- > Lee Ann Womack > . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
