To follow up a little here, i implemented Krippendorff's Alpha and ran it against all the data we currently have in discernatron, the distribution looks something like:
constraint count alpha >= 0.80 11 0.667 <= alpha < 0.80 18 0.500 <= alpha < 0.667 20 0.333 <= alpha < 0.500 26 0 <= alpha < 0.333 43 alpha < 0 31 This is a much lower level of agreement than i was expecting. The literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from which you can draw tentative conclusions. Below 0 indicates there is less agreement than random chance, and we need to re-evaluate the instructions to be more clear (probably true). On Thu, Oct 27, 2016 at 7:51 AM, Erik Bernhardson < [email protected]> wrote: > Thanks for the links! This is exactly what I was looking for. After > reviewing some of the options I'm going to do a first try with > Krippendorff's Alpha. It's ability to handle missing data from some graders > as well as being applicable down to n=2 seems promising. > > On Oct 26, 2016 11:37 AM, "Justin Ormont" <[email protected]> wrote: > >> You're in the area of: https://en.wikipedia.org/wiki/ >> Inter-rater_reliability >> >> --justin >> >> On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan <[email protected]> >> wrote: >> >>> Disclaimer: I'm not a math nerd, and I don't know the history of >>> Discernatron very well. >>> >>> ...but re: your second specialized concern, have you considered running >>> some more sophisticated inter-rater reliability statistics to get a better >>> sense of the degree of disagreement (controlling for random chance?). See >>> for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ >>> >>> - Jonathan >>> >>> On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < >>> [email protected]> wrote: >>> >>>> For a little backstory, in discernatron multiple judges provide scores >>>> in from 0 to 3 for results. Typically we only request a single query to be >>>> reviewed by two judges. We would like to measure the level of disagreement >>>> between these two judges, and if it crosses some threshold get two more >>>> scores, so we can then measure disagreement in the group of 4. Somehow >>>> though, we need to define how to measure that level of disagreement and >>>> what the threshold for needing more scores is. >>>> >>>> Some specialized concerns: >>>> * It is probably important to include not just that the users gave >>>> different values, but also how far apart they are. The difference between a >>>> 3 and a 2 is much smaller than between a 2 and a 0. >>>> * If the users agree that 80% of the results are all 0, but disagree on >>>> the last 20%, even though the average disagreement is low it's probably >>>> still important? Might be worthwhile to take all the agreements about >>>> irrelevant results and remove them before calculating disagreement? Not >>>> sure... >>>> >>>> I know we have a few math nerds here on the list, so hoping someone has >>>> a few ideas. >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> >>> -- >>> Jonathan T. Morgan >>> Senior Design Researcher >>> Wikimedia Foundation >>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)> >>> >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >>
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
