To follow up a little here, i implemented Krippendorff's Alpha and ran it
against all the data we currently have in discernatron, the distribution
looks something like:

constraint count
alpha >= 0.80 11
0.667 <= alpha < 0.80 18
0.500 <= alpha < 0.667 20
0.333 <= alpha < 0.500 26
0 <= alpha < 0.333 43
alpha < 0 31

This is a much lower level of agreement than i was expecting. The
literature suggests 0.80 as a reliable cutoff, and 0.667 as a cutoff from
which you can draw tentative conclusions. Below 0 indicates there is less
agreement than random chance, and we need to re-evaluate the instructions
to be more clear (probably true).

On Thu, Oct 27, 2016 at 7:51 AM, Erik Bernhardson <
[email protected]> wrote:

> Thanks for the links! This is exactly what I was looking for. After
> reviewing some of the options I'm going to do a first try with
> Krippendorff's Alpha. It's ability to handle missing data from some graders
> as well as being applicable down to n=2 seems promising.
>
> On Oct 26, 2016 11:37 AM, "Justin Ormont" <[email protected]> wrote:
>
>> You're in the area of: https://en.wikipedia.org/wiki/
>> Inter-rater_reliability
>>
>> --justin
>>
>> On Wed, Oct 26, 2016 at 11:31 AM, Jonathan Morgan <[email protected]>
>> wrote:
>>
>>> Disclaimer: I'm not a math nerd, and I don't know the history of
>>> Discernatron very well.
>>>
>>> ...but re: your second specialized concern, have you considered running
>>> some more sophisticated inter-rater reliability statistics to get a better
>>> sense of the degree of disagreement (controlling for random chance?). See
>>> for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
>>>
>>> - Jonathan
>>>
>>> On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson <
>>> [email protected]> wrote:
>>>
>>>> For a little backstory, in discernatron multiple judges provide scores
>>>> in from 0 to 3 for results. Typically we only request a single query to be
>>>> reviewed by two judges. We would like to measure the level of disagreement
>>>> between these two judges, and if it crosses some threshold get two more
>>>> scores, so we can then measure disagreement in the group of 4. Somehow
>>>> though, we need to define how to measure that level of disagreement and
>>>> what the threshold for needing more scores is.
>>>>
>>>> Some specialized concerns:
>>>> * It is probably important to include not just that the users gave
>>>> different values, but also how far apart they are. The difference between a
>>>> 3 and a 2 is much smaller than between a 2 and a 0.
>>>> * If the users agree that 80% of the results are all 0, but disagree on
>>>> the last 20%, even though the average disagreement is low it's probably
>>>> still important? Might be worthwhile to take all the agreements about
>>>> irrelevant results and remove them before calculating disagreement? Not
>>>> sure...
>>>>
>>>> I know we have a few math nerds here on the list, so hoping someone has
>>>> a few ideas.
>>>>
>>>> _______________________________________________
>>>> discovery mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>
>>>>
>>>
>>>
>>> --
>>> Jonathan T. Morgan
>>> Senior Design Researcher
>>> Wikimedia Foundation
>>> User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to