Only if website provides "Last-Modified: " "Etag: " response headers for
initial retrieval, and only if it can understand "If-Modified-Since" request
header of Nutch... However, even in this case Nuth must be polite and not
use frequent requests against same site, even with "If-Modified-Since"
request headers, - each HTTP request is logged, fresh response might be sent
indeed instead of 304, and each one (even 304) needs to establish TCP/IP,
Client-Thread, CPU, etc. - to use server-side resources.


You can have several threads concurrently accessing same website with crawl
delay 0 only and only if this website is fully under your control and
permission.


-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Paul
Tomblin
Sent: August-26-09 1:36 PM
To: [email protected]
Subject: Re: Is Nutch purposely slowing down the crawl, or is it just really
really inefficient?

On Wed, Aug 26, 2009 at 1:32 PM, MilleBii<[email protected]> wrote:
> beware you could create a kind of Denial of Service attack if you search
the
> site too quickly.
>

Well, since I fixed Nutch so that it understands "Last-Modified" and
"If-Modified-Since", it won't be downloading the pages 99% of the
time, just getting a "304" response.

-- 
http://www.linkedin.com/in/paultomblin


Reply via email to