On May 13, 2009, at 2:48 PM, Ted Dunning wrote:
Crawling a reference dataset requires essentially one-time bandwidth.
True, but we will likely evolve over time to have multiple datasets,
but no reason to get ahead of ourselves.
Also, it is possible to download, say, wikipedia in a single go.
Wikipedia isn't always that interesting from a relevance testing
standpoint, for IR at least (QA, machine learning, etc. it is more
so). A lot of queries simply have only one or two relevant results.
While that is useful, it is not often the whole picture of what one
needs for IR.
Likewise
there are various web-crawls that are available for research
purposes (I
think). See http://webascorpus.org/ for one example. These would
be single
downloads.
I don't entirely see the point of redoing the spidering.
I think we have to be able to control the spidering, so that we can
say we've vetted what's in it, due to copyright, etc. But, maybe
not. I've talked with quite a few people who have corpora available,
and it always comes down to copyright for redistribution in a public
way. No one wants to assume the risk, even though they all crawl and
redistribute (for money).
For instance, the Internet Archive even goes so far as to apply
robots.txt retroactively. We probably could do the same thing, but
I'm not sure if it is necessary.