I think that Nutch has to solve the problem: if you leave the problem to the
websites, they're more likely to cut you off than they are to implement
their own index storage scheme. Besides, they'd get it wrong, have stale
data, etc.

Maybe what is needed is brainstorming on a shared crawling scheme
implemented in Nutch. Maybe something based on a bittorrent-like protocol? 

incrediBILL seems to have a pretty good point.

-----Original Message-----
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 12:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:
>
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 
>
>
well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a "standard" for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61

Reply via email to