Not sure if this was mentioned before, but .... hm, I was going to point out http://index.isc.org/ (see http://ioiblog.wordpress.com/2008/11/07/kicking-off-the-ioi-blog/ ), but the server doesn't seem to be listening.... aha, here: http://ioiblog.wordpress.com/2009/02/
Perhaps we can get data from Dennis and Jeremie? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Ted Dunning <ted.dunn...@gmail.com> > To: general@lucene.apache.org > Sent: Wednesday, May 13, 2009 2:48:43 PM > Subject: Re: Open Relevance Project? > > Crawling a reference dataset requires essentially one-time bandwidth. > > Also, it is possible to download, say, wikipedia in a single go. Likewise > there are various web-crawls that are available for research purposes (I > think). See http://webascorpus.org/ for one example. These would be single > downloads. > > I don't entirely see the point of redoing the spidering. > > On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote: > > > Good point, although you never know. We also will have some bandwidth reqs > > for crawling. > > > > > > > -- > Ted Dunning, CTO > DeepDyve