On Tue, 28 Dec 1999 09:24:03 -0500, Bruce Weaver
<[EMAIL PROTECTED]> wrote:
[ snip, question ]
> It sounds like you have to related kappas here. Here are two papers you
> could consult, if that is the case:
>
> McKenzie DP et al. (1996). Comparing correlated kappas by resampling: Is
> one level of agreement significantly different from another? Journal of
> Psychiatric Research, 30, 483-492.
>
> McKenzie DP, Mackinnon AJ, Clarke DM. (1997). KAPCOM: A program for the
> comparison of kappa coefficients obtained from the same sample of
> observations. Perceptual & Motor Skills, 85, 899-902.
...
I am saving the reference, but -- having read the paper -- I have
doubts that I will ever use it.
The JPR paper shows bootstrap and a Monte Carlo results on a single
example, and the demonstrated robustness is "satisfactory" on a scale
of [poor, staisfactory, good], with simulated test sizes of 4.0% -
6.0% for a 5% test.
The authors claim that the methods are free of assumptions, but that
is not so. Their simulation is tried at several levels of "assumed
kappa values", which might matter. However, there are several
favorable conditions which are met in their example, which they fail
to *mention* as being relevant. The three Diagnoses had rates of
18%, 26%, and 28%,
- which are not *extreme* (for example, a real Dx of under 10% is
extreme); and
- which are not very different.
Further, the intercorrelation which was not tested, or the overlap of
the two alternative test-diagnoses, was moderate:
- no extreme overlap.
There are two other logical points that the paper ignores.
- How far do you extrapolate from small amounts of data? and
- Is "better" really a proper conclusion, when results are
"different"?
For the 50 cases in their example, the two test diagnoses
(abbreviated, dx) agreed on 40 correct judgments: 33 "no dx" and 7
"positive dx"; and agreed on 3 more instances where both were
mistaken. Thus, there were seven cases with differences. Scale G
labeled 6 extra as Cases, with 4 judgments being correct; Scale B
labeled 1 extra as Case, and that was correct.
This is only a small amount of "information" to process. It could be
a mistake to dump this into a blackbox and presume to believe that
there will be power to detect important differences.
Further, the superiority of one group in the example is adjudged, by
their method, as "almost significant". But what is the value of extra
positive-dx cases? Does capturing 4 extra cases outweigh the 2 extra
errors? This becomes more important when there is greater difference
in sensitivities. For an "assumption-free" look -- i.e., similarly
blind -- I can assert that in the 7 disagreements, method G was
correct 5 times, and method B was correct twice. If there are only
two kinds of errors (one method is purely, more inclusive), I can do
a binomial test; when there are four, I think the 5 vs. 2 is could be
compared that way, but not very confidently.
You have to look at the actual differences to see how much data you
have, and to consider, for yourself, what it might mean. In my own
experience, the two "methods" have usually been almost identical -- a
"new method" is different, mainly, in that it identifies more or fewer
cases, but with complete overlap.
Hope this helps, and wasn't too confusing.
--
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html