Nadav Har'El wrote: > Another approach is to use Term Relevance Sets, described in [1]. > This new approach not only requires less manual labor than > TREC's approach, > but also works better when the corpus is evolving. > > [1] "Scaling IR-System Evaluation using Term Relevance Sets", > Einat Amitay, > David Carmel, Ronny Lempel and Aya Soffer, SIGIR 2004, > http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf
This is an interesting approach; It says (more or less): creating TREC's QRels is based on human labor, judging documents of the searched collection against the "topics" (queries). Since there are too many docs (and in some Tracks too many queries) sampling techniques are used and only some docs/queries are manually judged. In the proposed TRel approach an (original) query is decorated with lists of positive and negative terms. Then, returned docs are evaluated against these term lists: the existence of positive terms in the returned doc increases its "relevant match probability", and the existence of a negative term reduces that. Well in general I like this idea, and the paper shows correlation between TRel judgments and QRel judgements. But I have a problem with the judging procedure using the same techniques being evaluated - checking term frequencies that is. Also, one could devise queries that are actually made of the "plus terms" and "minus terms"... I find the "magical" attempt to have your automated system find what a human would have selected, (was he able to read all docs), more, correct, appealing... Anyhow my take from this, at least, is that the quality checking mechanism that we want to add to the benchmark should be general/open enough to allow this notion of quality assessing, or any other notion. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]