Hi all,
I just committed a new implementation of venerable fetcher, called
Fetcher2. It uses a producer/consumers model with a set of per-host
queues. Theoretically it should be able to achieve a much higher
throughput, especially for fetchlists with a lot of contention (many
urls from the same hosts).
It should be possible to achieve the same fetching rate with a smaller
number of threads, and most importantly to avoid the dreaded "Exceeded
http.max.delays: retry later" error.
It is available through "bin/nutch fetch2".
From the javadoc:
"A queue-based fetcher.
This fetcher uses a well-known model of one producer (a QueueFeeder) and
many consumers (FetcherThread-s).
QueueFeeder reads input fetchlists and populates a set of
FetchItemQueue-s, which hold FetchItem-s that describe the items to be
fetched. There are as many queues as there are unique hosts, but at any
given time the total number of fetch items in all queues is less than a
fixed number (currently set to a multiple of the number of threads).
As items are consumed from the queues, the QueueFeeder continues to add
new input items, so that their total count stays fixed (FetcherThread-s
may also add new items to the queues e.g. as a results of redirection) -
until all input items are exhausted, at which point the number of items
in the queues begins to decrease. When this number reaches 0 fetcher
will finish.
This fetcher implementation handles per-host blocking itself, instead of
delegating this work to protocol-specific plugins. Each per-host queue
handles its own "politeness" settings, such as the maximum number of
concurrent requests and crawl delay between consecutive requests - and
also a list of requests in progress, and the time the last request was
finished. As FetcherThread-s ask for new items to be fetched, queues may
return eligible items or null if for "politeness" reasons this host's
queue is not yet ready.
If there are still unfetched items on the queues, but none of the items
are ready, FetcherThread-s will spin-wait until either some items become
available, or a timeout is reached (at which point the Fetcher will
abort, assuming the task is hung)."
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com