Disclaimer: I'm not a math nerd, and I don't know the history of Discernatron very well.
...but re: your second specialized concern, have you considered running some more sophisticated inter-rater reliability statistics to get a better sense of the degree of disagreement (controlling for random chance?). See for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ - Jonathan On Wed, Oct 26, 2016 at 11:21 AM, Erik Bernhardson < [email protected]> wrote: > For a little backstory, in discernatron multiple judges provide scores in > from 0 to 3 for results. Typically we only request a single query to be > reviewed by two judges. We would like to measure the level of disagreement > between these two judges, and if it crosses some threshold get two more > scores, so we can then measure disagreement in the group of 4. Somehow > though, we need to define how to measure that level of disagreement and > what the threshold for needing more scores is. > > Some specialized concerns: > * It is probably important to include not just that the users gave > different values, but also how far apart they are. The difference between a > 3 and a 2 is much smaller than between a 2 and a 0. > * If the users agree that 80% of the results are all 0, but disagree on > the last 20%, even though the average disagreement is low it's probably > still important? Might be worthwhile to take all the agreements about > irrelevant results and remove them before calculating disagreement? Not > sure... > > I know we have a few math nerds here on the list, so hoping someone has a > few ideas. > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
