Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? ----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <nutch-dev@lucene.apache.org> Sent: Thursday, January 18, 2007 5:18 AM Subject: Fetcher2
> Hi all, > > I just committed a new implementation of venerable fetcher, called > Fetcher2. It uses a producer/consumers model with a set of per-host > queues. Theoretically it should be able to achieve a much higher > throughput, especially for fetchlists with a lot of contention (many > urls from the same hosts). > > It should be possible to achieve the same fetching rate with a smaller > number of threads, and most importantly to avoid the dreaded "Exceeded > http.max.delays: retry later" error. > > It is available through "bin/nutch fetch2". > > From the javadoc: > > "A queue-based fetcher. > > This fetcher uses a well-known model of one producer (a QueueFeeder) and > many consumers (FetcherThread-s). > > QueueFeeder reads input fetchlists and populates a set of > FetchItemQueue-s, which hold FetchItem-s that describe the items to be > fetched. There are as many queues as there are unique hosts, but at any > given time the total number of fetch items in all queues is less than a > fixed number (currently set to a multiple of the number of threads). > > As items are consumed from the queues, the QueueFeeder continues to add > new input items, so that their total count stays fixed (FetcherThread-s > may also add new items to the queues e.g. as a results of redirection) - > until all input items are exhausted, at which point the number of items > in the queues begins to decrease. When this number reaches 0 fetcher > will finish. > > This fetcher implementation handles per-host blocking itself, instead of > delegating this work to protocol-specific plugins. Each per-host queue > handles its own "politeness" settings, such as the maximum number of > concurrent requests and crawl delay between consecutive requests - and > also a list of requests in progress, and the time the last request was > finished. As FetcherThread-s ask for new items to be fetched, queues may > return eligible items or null if for "politeness" reasons this host's > queue is not yet ready. > > If there are still unfetched items on the queues, but none of the items > are ready, FetcherThread-s will spin-wait until either some items become > available, or a timeout is reached (at which point the Fetcher will > abort, assuming the task is hung)." > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > >