On May 13, 2009, at 2:48 PM, Ted Dunning wrote:

Crawling a reference dataset requires essentially one-time bandwidth.


True, but we will likely evolve over time to have multiple datasets, but no reason to get ahead of ourselves.


Also, it is possible to download, say, wikipedia in a single go.

Wikipedia isn't always that interesting from a relevance testing standpoint, for IR at least (QA, machine learning, etc. it is more so). A lot of queries simply have only one or two relevant results. While that is useful, it is not often the whole picture of what one needs for IR.

Likewise
there are various web-crawls that are available for research purposes (I think). See http://webascorpus.org/ for one example. These would be single
downloads.

I don't entirely see the point of redoing the spidering.

I think we have to be able to control the spidering, so that we can say we've vetted what's in it, due to copyright, etc. But, maybe not. I've talked with quite a few people who have corpora available, and it always comes down to copyright for redistribution in a public way. No one wants to assume the risk, even though they all crawl and redistribute (for money).

For instance, the Internet Archive even goes so far as to apply robots.txt retroactively. We probably could do the same thing, but I'm not sure if it is necessary.

Reply via email to