If the mean differences are large enough to be interesting then the variance component will be larger. After all, it is just a variant of ANOVA. The t-test, on the otherhand, is also a function of sample size. The variance component is more of an effect size measure. I would argue that for training purposes you would want to know the results on indivudal raters, thus, I don't recommend ICCs for that purpose. It is best for assessing the overall impact of multiple surces of error on the measurement process. If I have a paper in which I wish to imply yhe reliability for my measure, I don't want to have to report a potful of correaltions and t-tests, I want a value that tells the impact of all sources of error on the outcome measured.
Paul R. Swank, Ph.D. Professor, Developmental Pediatrics Medical School UT Health Science Center at Houston -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Richard Ulrich Sent: Monday, May 17, 2004 8:57 AM To: [EMAIL PROTECTED] Subject: Re: [edstat] paired t-test for test-retest reliability reference? [I'm top-posting a couple of comments, and deleting most of my own post that was cited.] You seem to make one of my points -- that the popular ICCs will cover up mean differences, which might or might not be interesting. It may also cover up a single poorly correlated rater, among multiple raters. That can be good for planning for new raters, but it is not-so-good for training raters or for reporting results in full. On 13 May 2004 07:18:47 -0700, [EMAIL PROTECTED] (Paul R Swank) wrote: > First, let's consider the 2 observation case. I have 2 assessments of > a behavior rating taken 20 minutes apart; I wish to know how reliable > the assessments are. There are two potential sources of error, the > relative error over time, in which the order of scores for subject a > and subject b on the two assessments may be the same or different, and > the absolute error in which all subjects may be lower on the second > assessment. If I do a Pearson correlation between the two, I find a > correlation of .78097 (n=313, p < .0001). I do an analysis of variance > with repeated measures on time (the equivalent of the paired t-test, > and find a significant difference between the means (time 1, mean = > 3.377, sd=1.10; mean 2 = 3.291, sd=1.16; F(1, 312) = 4.16; p = .0422). > Now, I do a generalizability analysis. I find the following variance > components: > > Subjects .99269 > Time .00300 > Subjects by Time .27842 > > The generalizability coefficient (or ICC) considering only the > relative error (interaction) is > > .99269 / (.99269 + .003) = .99269/1.27111 = .78096 which is the > Pearson - oops! for that first denominator - > Correlation within rounding. I then figure the coefficient taking into > account the mean difference as well. > > .99269 / (.99269 + .003 + .27842) = .99269 / 1.27411 = .779. > > I have had a minimal effect on the reliability as should be obvious by > the variance component for time, which is very small relative to the > other variance components. > > Thus, even though the difference between time 1 and 2 is significant > (due in part to the large sample and the strong correlation between > two observations taken 20 minutes apart), the effect on the > reliability is small. Of course, I could observe that in the means as > well, since they re very close, but of course, when you see two means, > many people want to know if they are statistically different. > > Add to this result, the fact that, because in reality I have 5 > assessments of the observed variable over an hour's time, the > generalizability result is much easier to deal with than is 10 unique > Pearson correlations and an ANOVA (hopefully not 10 paired t-tests), > and it becomes clear that the generalizability analysis is cleaner > than breaking the analysis into two parts. > [snip sig.] > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > On Behalf Of Richard Ulrich > Sent: Wednesday, May 12, 2004 2:52 PM [snip, his and mine] RU > > Yes, it is the overall impact, and that can be useful for the > *final* statement, especially when a very precise statement of > overall impact is warranted -- because, for instance, power analyses are > being based on the exact value of the exact form of ICC that is needed: Same > versus different raters; single versus > multiple scorers. > > And I think it is an over-generalization to prefer an ICC when the > issue is the cruder one of apparent adequacy. The ICC is less > informative (about > means) and less transparent (multiple versions available to select, all of > them burying the means). > > [snip, rest] -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . ================================================================= . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================