[discovery] How to measure disagreement between human judges in discernatron?

Erik Bernhardson Wed, 26 Oct 2016 11:22:14 -0700

For a little backstory, in discernatron multiple judges provide scores in
from 0 to 3 for results. Typically we only request a single query to be
reviewed by two judges. We would like to measure the level of disagreement
between these two judges, and if it crosses some threshold get two more
scores, so we can then measure disagreement in the group of 4. Somehow
though, we need to define how to measure that level of disagreement and
what the threshold for needing more scores is.


Some specialized concerns:
* It is probably important to include not just that the users gave
different values, but also how far apart they are. The difference between a
3 and a 2 is much smaller than between a 2 and a 0.
* If the users agree that 80% of the results are all 0, but disagree on the
last 20%, even though the average disagreement is low it's probably still
important? Might be worthwhile to take all the agreements about irrelevant
results and remove them before calculating disagreement? Not sure...

I know we have a few math nerds here on the list, so hoping someone has a
few ideas.

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

[discovery] How to measure disagreement between human judges in discernatron?

Reply via email to