Hello everyone, I am a student at the Human-Centered Computing Research Group at Freie University Berlin in Germany and im working on a projekt to semantically enrich ideas and other short texts.
To evaluate how good people annotate concepts in texts with our software, we want to compare the concepts to a gold standard of expected annotations. While working on this I realized the following thing: If I use Precision and Recall, the decision is always binary (wrong/right), but there are times when multiple concepts are "kind of right" (see example below). Could you recommend me approaches or algorithms that deal with this? What I'm trying to achieve in detail: I have a sentence like this: "This pet food distribution center is open now." I also have my annotations user generated annotations. Now I want to consider my annotations as gold standard (GS) so that I can compare the user generated annotations with that GS. If I use Precision/Recall/F-Measure I run into many issues because the concepts I would consider 'best' are - http://dbpedia.org/resource/Pet_food - http://dbpedia.org/resource/Food_distribution - http://dbpedia.org/resource/Distribution_center But that would mean that for example 'Pet' and 'Food' would be as wrong as 'Car' and 'Closet'. I could annotate redundantly but that would incentivce the user to overannotate. Best regards Maximilian Stauss HCC | FU Berlin _______________________________________________ DBpedia-discussion mailing list DBpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion