Re: paired t-test for test-retest reliability reference?

Paul R Swank Mon, 17 May 2004 09:19:21 -0700

If the mean differences are large enough to be interesting then the variance
component will be larger. After all, it is just a variant of ANOVA. The
t-test, on the otherhand, is also a function of sample size. The variance
component is more of an effect size measure. I would argue that for training
purposes you would want to know the results on indivudal raters, thus, I
don't recommend ICCs for that purpose. It is best for assessing the overall
impact of multiple surces of error on the measurement process. If I have a
paper in which I wish to imply yhe reliability for my measure, I don't want
to have to report a potful of correaltions and t-tests, I want a value that
tells the impact of all sources of error on the outcome measured.


Paul R. Swank, Ph.D. 
Professor, Developmental Pediatrics
Medical School
UT Health Science Center at Houston 


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Richard Ulrich
Sent: Monday, May 17, 2004 8:57 AM
To: [EMAIL PROTECTED]
Subject: Re: [edstat] paired t-test for test-retest reliability reference?


[I'm top-posting a couple of comments, and deleting most of my own post that
was cited.]

You seem to make one of my points -- that the popular ICCs
will cover up mean differences, which might or might not be interesting.  It
may also cover up a single poorly correlated 
rater, among multiple raters.  That can be good for planning for new raters,
but it is not-so-good for training raters or for reporting results in full.


On 13 May 2004 07:18:47 -0700, [EMAIL PROTECTED] (Paul R Swank)
wrote:

> First, let's consider the 2 observation case. I have 2 assessments of 
> a behavior rating taken 20 minutes apart; I wish to know how reliable 
> the assessments are. There are two potential sources of error, the 
> relative error over time, in which the order of scores for subject a 
> and subject b on the two assessments may be the same or different, and 
> the absolute error in which all subjects may be lower on the second 
> assessment. If I do a Pearson correlation between the two, I find a 
> correlation of .78097 (n=313, p < .0001). I do an analysis of variance 
> with repeated measures on time (the equivalent of the paired t-test, 
> and find a significant difference between the means (time 1, mean = 
> 3.377, sd=1.10; mean 2 = 3.291, sd=1.16; F(1, 312) = 4.16; p = .0422). 
> Now, I do a generalizability analysis. I find the following variance 
> components:
> 
> Subjects                              .99269
> Time                                  .00300
> Subjects by Time                      .27842
> 
> The generalizability coefficient (or ICC) considering only the 
> relative error (interaction) is
> 
> .99269 / (.99269 + .003) = .99269/1.27111 = .78096 which is the 
> Pearson
   - oops! for that first denominator -
> Correlation within rounding. I then figure the coefficient taking into 
> account the mean difference as well.
> 
> .99269 / (.99269 + .003 + .27842) = .99269 / 1.27411 = .779.
> 
> I have had a minimal effect on the reliability as should be obvious by 
> the variance component for time, which is very small relative to the 
> other variance components.
> 
> Thus, even though the difference between time 1 and 2 is significant 
> (due in part to the large sample and the strong correlation between 
> two observations taken 20 minutes apart), the effect on the 
> reliability is small. Of course, I could observe that in the means as 
> well, since they re very close, but of course, when you see two means, 
> many people want to know if they are statistically different.
> 
> Add to this result, the fact that, because in reality I have 5 
> assessments of the observed variable over an hour's time, the 
> generalizability result is much easier to deal with than is 10 unique 
> Pearson correlations and an ANOVA (hopefully not 10 paired t-tests), 
> and it becomes clear that the generalizability analysis is cleaner 
> than breaking the analysis into two parts.
> 
[snip sig.]
> 
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> On Behalf Of Richard Ulrich
> Sent: Wednesday, May 12, 2004 2:52 PM

[snip, his and mine]
RU >
> Yes, it is the overall impact, and that can be useful for the
> *final* statement, especially when a very precise statement of
> overall impact is warranted -- because, for instance, power analyses are
> being based on the exact value of the exact form of ICC that is needed:
Same
> versus different raters; single versus 
> multiple scorers.  
> 
> And I think it is an over-generalization to prefer an ICC when the 
> issue is the cruder one of apparent adequacy.  The ICC is less 
> informative (about
> means) and less transparent (multiple versions available to select, all of
> them burying the means).
> 
> [snip, rest]

-- 
Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html
.
. =================================================================
Instructions for joining and leaving this list, remarks about the problem of
INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: paired t-test for test-retest reliability reference?

Reply via email to