Addressing most of the recent discussion below...
On 16/9/2010 4:24 AM, Dan Cardin wrote:
1. Should the Open Relevance viewer be capable of importing text and
images?
Corpora, IMO, should be text only and index-ready (e.g. no special
parsing required). This is what I assumed in Orev, as well (see below).
Is the objective of the Open Relevance Viewer to provide a crowd sourcing
tool that can have its data annotated and then to use the annotated data for
determining the performance of machine learning techniques/algorithms? Or,
is it to provide a generic crowd souring tool for academics, government, and
industry to annotate data with? Or am I missing the point?
This tool should be, as Grant and Mark mentioned, engine agnostic. It
should provide those interested with tools to be able to judge
effectiveness of different engines, and also different methods with the
same engine.
Hence, the most basic implementation should know to handle many corpora
and topics for more than one (natural) language, and the crowd-sourcing
portion of it is where a user can create judgments - e.g. view a
document from a corpus side by side with a topic, and mark "Relevant",
"Non-relevant" (or "Skip this").
This banal implementation after several hundreds of human-hours will
result in packages containing corpora, topics and judgments for several
languages. This can then be used as basis for more sophisticated parts
of the project, where relevance ranking of actual query results,
TREC-like testing, MAP/MRR and user behavior tracking are just examples.
In other words, IMHO Grant's view is a bit too far going for this stage,
where there's still a lot of fundamental work to do.
Robert, from the discussion we had a while ago I gathered you are
thinking the same?
Once such data exists in a central system, importing corpora and topics,
and exporting them back with judgments in various formats (TREC, CLEF,
FIRE) can be done fairly easily. We just need to make sure that system
stores all data correctly.
Sorry for bringing this up again, but I think I pretty much did most of
that work already, so no need for redundant efforts. In Orev I have
already spec'd and implemented all the above. What is missing is some
better GUI and user management. I suggest you have a look at it and at
its DB scheme: http://github.com/synhershko/Orev/blob/master/Orev.png
How are annotations used for judgments obtained? Separate file, specifed by the
user?
If a tool like Orev will be used, then this data can be pulled directly
from its DB by the actual test tools (if separate).
Can you provide me with a direct link to the TREC format?
http://trec.nist.gov/pubs/trec1/papers/01.txt
But if we are not going to base data storage on the FS, there's no need
to stick to a particular format, only when exporting judgments...
Itamar