Re: Open Relevance Project?

Grant Ingersoll Mon, 11 May 2009 15:13:25 -0700


On May 11, 2009, at 4:01 PM, Michael McCandless wrote:

I'd love to see a resource like this (it's high time!), and I'll try
to help when/where I can, starting with some initial
comments/questions:

I think it's actually quite a challenge to do well.  EG it's easy to
make a corpus that's too easy because it's highly diverse (and thus
most search engines have no trouble pulling back relevant results).
Instead, I think the content set should be well & tightly scoped to a
certain topic, and not necessarily that large (ie we don't need a huge
number of documents).  It would help if that scoping is towards
content that many people find "of interest" so we get "accurate"
judgements by as wide an audience as possible.

I think we will want a generic one, and then focused ones, but weshould start with generic at first.



EG how about coverage of the 2009 H1N1 outbreak (that's licensed
appropriately)?  Or... the 2008 US presidential election?  Or...
research on Leukemia (but I fear such content is not typically
licensed appropriately, nor will it have wide interest).

What does "using Nutch to crawl Creative Commons" actually mean?  Can
I browse the content that's being crawled?


Nutch has a CC plugin that allows it to filter out non-CC content, AIUI.



Also, to help us build up the relevance judgements, I think we should
build a basic custom app for collecting queries as well as annotating
them.  I should be able to go to that page and run my own queries,
which are collected.  Then, I should be able to browse previously
collected queries, click on them, and add my own judgement.  The site
should try to offer up queries that are "in need" of judgements.  It
should run the search and let me step through the results, marking
those that are relevant; but we would then bias the results to that
search engine; maybe under the hood we rotate through search engines
each time?

Do we have anyone involved who's built similar corpora before?  Or has
anyone read papers on how prior corpora were designed/created?

This is all good, but here I'm thinking simpler, at least at first. Idon't know that we need to be writing apps, although feel free, sinceit is O/S after all. :-) I was wondering if we couldn't handle thiswiki style (how is still not clear) whereby we simply have pages thatcontain the queries and judgments and over time the wisdom of thecrowds will work to maintain standards, fill in gaps, etc. Maybe,in regards to judgments, we allow people to vote for them, which overtime will yield an appropriate result (but is subject to earlyissues). Not sure what all that means just yet, but the wiki approachallows us to get going with minimal resources while still deliveringvalue. Hmm, now it's starting to sound like an app... ;-)

As opposed to TREC style stuff, I don't think we need the top 1000(although it could work). Just the top ten or twenty. Sometimes, itcan even be useful to just rate a whole page of results at once, evenat the cost of granularity. Basically, what I'm proposing we do iscarry out a pragmatic relevance test out in the open, just as peopleshould do in house. I think this fits with Lucene's model ofoperation quite well: be practical by focusing on real data and realfeedback as opposed to obsessing over theory. (Not that you weresuggesting otherwise, I'm just stating it)

I need to find the reference, but I recall the last edition of SIGIRhaving a discussion on crowdsourcing relevance judgments.


-Grant

Re: Open Relevance Project?

Reply via email to