Only if website provides "Last-Modified: " "Etag: " response headers for initial retrieval, and only if it can understand "If-Modified-Since" request header of Nutch... However, even in this case Nuth must be polite and not use frequent requests against same site, even with "If-Modified-Since" request headers, - each HTTP request is logged, fresh response might be sent indeed instead of 304, and each one (even 304) needs to establish TCP/IP, Client-Thread, CPU, etc. - to use server-side resources.
You can have several threads concurrently accessing same website with crawl delay 0 only and only if this website is fully under your control and permission. -Fuad http://www.linkedin.com/in/liferay http://www.tokenizer.org -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Paul Tomblin Sent: August-26-09 1:36 PM To: [email protected] Subject: Re: Is Nutch purposely slowing down the crawl, or is it just really really inefficient? On Wed, Aug 26, 2009 at 1:32 PM, MilleBii<[email protected]> wrote: > beware you could create a kind of Denial of Service attack if you search the > site too quickly. > Well, since I fixed Nutch so that it understands "Last-Modified" and "If-Modified-Since", it won't be downloading the pages 99% of the time, just getting a "304" response. -- http://www.linkedin.com/in/paultomblin
