Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>... > - posted to sci.stat.edu, > also to sci.stat.consult where a similar question was asked. > > > On Wed, 29 May 2002 10:42:14 GMT, Adrian <[EMAIL PROTECTED]> > wrote: > > > Hello how is everybody, > > > > This is my first post here. I am wondering if I'm on the right track. Am > > doing an experiment in which I now have to work out inter and intraobserver > > error. From my reading sofar, it seems that this kappa value is the way to > > go. I need a way to compare the results (in the format of 'severe', > > 'moderate', 'mild', 'normal' etc...) from THREE raters (not TWO as most > [ ... ] > > No, and no. > > You don't want kappa. It is most suited for dichotomies; > it is sometimes used with multiple categories; it is a poor > choice when you have well-ordered steps. When > you have scores, use some version of correlation. > > If you are 'doing an experiment', then you ought to want to > know the relations between the raters taken as pairs: > is *any* one of them different? That's my opinion. If you are > going to report on the multiple scorers, the intraclass correlation > is probably most popular statistic; it is computed from an > ANOVA table. On the other hand, I have long recommended > looking at the pairs. That is easily done with > the inter-class correlation, the ordinary Pearson r (which > SPSS provides as an ancillary statistic of the paired-t test). > Look at the r and the t and notice the standard deviations. > > In my area, the three-rater intraclass correlation is not needed > for much except as a concise summary for the final write up. > It can be computed from the ANOVA table. You should be able > to search and find discussion of these at various places, including > the SPSS web site. (I don't remember how much is in my own > stats-FAQ on the subject, but there might be....) > I do not know the answer from personal experience, but here's my source for answering your question.
http://ourworld.compuserve.com/homepages/jsuebersax/ Kappa Coefficients 1. Introduction Though the kappa coefficient was very popular for many years, there has been continued and increasing criticism of its use. At the least, it can be said that (1) kappa should not be viewed as the standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) one should consider some of the alternatives so as to make an informed decision. One can distinguish between two possible uses of kappa: (1) To test rater independence (i.e. as a test statistic), (2) To quantify the level of agreement (i.e., as an effect size). The former involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases). It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable. Thus, the common statement that kappa is a "chance-corrected measure of agreement" is misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of rater decision making, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it. A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intraclass correlation. 2. Pros and Cons about Kappa Pros * Kappa statistics are easily calculated and software is readily available (e.g., SAS PROC FREQ). * Kappa statistics are appropriate for testing whether agreement exceeds chance levels for binary and nominal ratings. Cons * Kappa is not really a chance-corrected measure of agreement (see above). * Kappa is an omnibus index of agreement. It does not make distinctions among various types and sources of disagreement. * Kappa is influenced by trait prevalence (distribution) and base-rates. As a result, kappas are seldom comparable across studies, procedures, or populations (Thompson & Walter, 1988; Feinstein & Cicchetti, 1990). * Kappa may be low even though there are high levels of agreement and even though individual ratings are accurate. Whether a given kappa value implies a good or a bad rating system or diagnostic method depends on what model one assumes about the decisionmaking of raters (Uebersax, 1988). * With ordered category data, one must select weights arbitrarily to calculate weighted kappa (Maclure & Willet, 1987). * Kappa requires that two rater/procedures use the same rating categories. There are situations where one is interested in measuring the consistency of ratings for raters that use different categories (e.g., one uses a scale of 1 to 3, another uses a scale of 1 to 5). * Tables that purport to categorize ranges of kappa as "good," "fair," "poor" etc. are inappropriate; do not use them. 3. Bibliography: Kappa Coefficient ... Doc . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
