There is a problem with max threats per host I'm experiencing right now. 

Nutch is completely ignoring 'maximum threads per host' and delay after one
thread finishes with a host.

I have the version from 6/24.
The problem is there regardless if I go with the default settings (put
nothing in nutch-site.xml regarding the fetcher) or I specify fetcher
threads=20.

To reproduce:
Fetch something in several segments. 
Merge several segments.
Replace in the configuration of regex-urlfilter.txt:
[EMAIL PROTECTED]
with 
[EMAIL PROTECTED]
because I want to crawl all the forums in my target sites.

Delete the database, and recreate it again. (updatedb)
Start fetching again.

At this point I can see 20 urls to the same host being fetched. And bunch of
errors happening because the target sites cannot serve me 20 pages per 10
seconds.

Is this because I'm excluding the default "?=" or... ? Any idea how to fetch
maximum 1 page per host per fetching run?

I partially solved the problem my splitting the fetching workload in 20
segments and fetching 3-5 threads per segment, but this isn't nice solution
as I have to micro-manage all the fetch segments and merge them afterward.

E.

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 07, 2005 2:25 PM
To: [email protected]
Subject: Re: Problems with Fetcher threads?

Are you just crawling a single site?  Just one?  What is 
fetcher.threads.per.host?   It is one by default, but only if 
fetcher.threads.per.host is greater than one will the fetcher be able to 
effectively use multiple threads to crawl a single site.  Otherwise 
these threads will conflict and fail to fetch pages.

Doug

Jakob Heidebrecht wrote:
> Hallo,
> 
> Is there a problem of fetching with many threads?
> 
> I injected a single URL to the DB and fetched in each case three circles.
> 
> First case 1 fetcher thread, second and third 20 fetcher threads.
> 
> In the first case I got 102 pages,
> in the sekond 19 pages and 
> in the third 22 pages.
> 
> Everything else was the same all the time.
> 
> Is this a bug?
> May the server kick me out whet I'm fetching it with many threads at the
> same time?
> 
> Regards,
> 
> Jakob
> 





-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP, 
AMD, and NVIDIA.  To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to