Re: Open Relevance Project?

Grant Ingersoll Wed, 13 May 2009 12:14:22 -0700


On May 13, 2009, at 2:48 PM, Ted Dunning wrote:

Crawling a reference dataset requires essentially one-time bandwidth.

True, but we will likely evolve over time to have multiple datasets,but no reason to get ahead of ourselves.

Also, it is possible to download, say, wikipedia in a single go.

Wikipedia isn't always that interesting from a relevance testingstandpoint, for IR at least (QA, machine learning, etc. it is moreso). A lot of queries simply have only one or two relevant results.While that is useful, it is not often the whole picture of what oneneeds for IR.

Likewise
there are various web-crawls that are available for researchpurposes (Ithink). See http://webascorpus.org/ for one example. These wouldbe single
downloads.

I don't entirely see the point of redoing the spidering.

I think we have to be able to control the spidering, so that we cansay we've vetted what's in it, due to copyright, etc. But, maybenot. I've talked with quite a few people who have corpora available,and it always comes down to copyright for redistribution in a publicway. No one wants to assume the risk, even though they all crawl andredistribute (for money).

For instance, the Internet Archive even goes so far as to applyrobots.txt retroactively. We probably could do the same thing, butI'm not sure if it is necessary.

Re: Open Relevance Project?

Reply via email to