There is a problem with max threats per host I'm experiencing right now. Nutch is completely ignoring 'maximum threads per host' and delay after one thread finishes with a host.
I have the version from 6/24. The problem is there regardless if I go with the default settings (put nothing in nutch-site.xml regarding the fetcher) or I specify fetcher threads=20. To reproduce: Fetch something in several segments. Merge several segments. Replace in the configuration of regex-urlfilter.txt: [EMAIL PROTECTED] with [EMAIL PROTECTED] because I want to crawl all the forums in my target sites. Delete the database, and recreate it again. (updatedb) Start fetching again. At this point I can see 20 urls to the same host being fetched. And bunch of errors happening because the target sites cannot serve me 20 pages per 10 seconds. Is this because I'm excluding the default "?=" or... ? Any idea how to fetch maximum 1 page per host per fetching run? I partially solved the problem my splitting the fetching workload in 20 segments and fetching 3-5 threads per segment, but this isn't nice solution as I have to micro-manage all the fetch segments and merge them afterward. E. -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, July 07, 2005 2:25 PM To: [email protected] Subject: Re: Problems with Fetcher threads? Are you just crawling a single site? Just one? What is fetcher.threads.per.host? It is one by default, but only if fetcher.threads.per.host is greater than one will the fetcher be able to effectively use multiple threads to crawl a single site. Otherwise these threads will conflict and fail to fetch pages. Doug Jakob Heidebrecht wrote: > Hallo, > > Is there a problem of fetching with many threads? > > I injected a single URL to the DB and fetched in each case three circles. > > First case 1 fetcher thread, second and third 20 fetcher threads. > > In the first case I got 102 pages, > in the sekond 19 pages and > in the third 22 pages. > > Everything else was the same all the time. > > Is this a bug? > May the server kick me out whet I'm fetching it with many threads at the > same time? > > Regards, > > Jakob > ------------------------------------------------------- This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual core and dual graphics technology at this free one hour event hosted by HP, AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
