Nadav Har'El wrote:

> Another approach is to use Term Relevance Sets, described in [1].
> This new approach not only requires less manual labor than
> TREC's approach,
> but also works better when the corpus is evolving.
>
> [1] "Scaling IR-System Evaluation using Term Relevance Sets",
> Einat Amitay,
> David Carmel, Ronny Lempel and Aya Soffer, SIGIR 2004,
> http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf

This is an interesting approach; It says (more or less): creating TREC's
QRels is based on human labor, judging documents of the searched collection
against the "topics" (queries). Since there are too many docs (and in some
Tracks too many queries) sampling techniques are used and only some
docs/queries are manually judged. In the proposed TRel approach an
(original) query is decorated with lists of positive and negative terms.
Then, returned docs are evaluated against these term lists: the existence
of positive terms in the returned doc increases its "relevant match
probability", and the existence of a negative term reduces that.

Well in general I like this idea, and the paper shows correlation between
TRel judgments and QRel judgements. But I have a problem with the judging
procedure using the same techniques being evaluated - checking term
frequencies that is. Also, one could devise queries that are actually made
of the "plus terms" and "minus terms"... I find the "magical" attempt to
have your automated system find what a human would have selected, (was he
able to read all docs), more, correct, appealing...

Anyhow my take from this, at least, is that the quality checking mechanism
that we want to add to the benchmark should be general/open enough to allow
this notion of quality assessing, or any other notion.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to