[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Andrzej Bialecki  updated NUTCH-339:
------------------------------------

    Attachment: patch4-trunk.txt

These patches implement a queue-based Fetcher, where fetching threads don't 
spin-wait for blocking entries.

A few comments on the architecture of Fetcher2:

* per-host blocking is disabled in lib-http if plugins are used with Fetcher2, 
because in this case the Fetcher2 handles blocking - otherwise the plugin works 
as before (the end effect is that you can still use the plain Fetcher with 
these patches).

* RobotRules can be obtained now from any protocol, and a default dummy 
implementation is provided for those protocols that normally don't define them.

* fetchlist records are read by a separate thread (QueueFeeder), and stuffed 
into a set of queues, based on a combination of protocol + host name (or host 
address, depending on a setting); i.e. for a URL 
"http://www.cnn.com/SPORT/index.html"; the queueID will be either 
"http://www.cnn.com"; or "http://64.236.24.28"; . QueueFeeder maintains a fixed 
total size of the queues (N * number of fetcher threads), until it exhausts all 
input records.

* each proto/host queue keeps its own information about:

   - max number of threads (maxThreads) for this proto/host combination
   - crawlDelay (when maxThreads == 1) and minCrawlDelay (when maxThreads > 1)
   - a set of items currently being processed (inProgress)
   - time when the last fetch request was finished (endTime)

Items are picked from the queue in a FIFO fashion, if inProgress.size() < 
maxThreads and if endTime + crawlDelay < now. Picked items are recorded in 
inProgress set.

* there is one global set of queues in the fetcher, with some utility methods 
to keep track of the total number of queued items, and to get the first 
eligible item from any queue.

* FetcherThread-s try to pick up new work items from the queues, or spin-wait 
if none are available yet.

* when both the input and the queues are exhausted fetcher will finish its map 
operation.

In my limited experiments I didn't notice the previous effects of thread 
starvation, because threads don't block if they can't process current item. 
However, there are still issues with very slow sites (most probably we need to 
terminate such threads), and in case of slow sites and many pages from the same 
host fetch items still tend to accumulate - so at the end of the fetch the 
speed may be still slightly lower.

The advantage of this new architecture is that it's much much easier to 
understand how blocking occurs, and also that reading from input is decoupled 
from further processing, which should make it easier to move later on to 
NIO-based processing (non-blocking).

Some open issues:

* it was quite difficult to consistently measure the fetching speed. Due to 
changing network conditions results vary even for the same fetchlist, and even 
with the same implementation - and differences can be significant (like 15 
pages/s for one run vs. 3 pages/s for another run with exactly same parameters).

* I decided for now not to use NIO. The reason is that protocol plugins don't 
support it, so if we switched to select-based modus operandi we would have to 
rewrite all protocol plugins.

Please give it a try - comments, suggestions and patches are welcome!

> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
>                 Key: NUTCH-339
>                 URL: http://issues.apache.org/jira/browse/NUTCH-339
>             Project: Nutch
>          Issue Type: Task
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: patch.txt, patch2.txt, patch3.txt, patch4-trunk.txt
>
>
> As I (and Stefan?) see it there are two major areas the current fetcher could 
> be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the biggest
> problem of current fetcher(together with robots.txt handling).
> With a simple code changes like replacing it with a PriorityQueue
> based solution showed very promising results in increased IO.
> 2. Changing fetcher to use non blocking io (this requires great amount
> of work as we need to implement the protocols from scratch again).
> I would like to start with working towards #1 by first refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to lib-http
> does not allow other kinds of scheduling strategies to be implemented
> (it is hardcoded to fetch robots.txt from the same thread when requesting
> a page from a site from witch it hasn't tried to load robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a way that none
> of the current functionality is (at least deliberately) changed leaving
> current functionality as is thus leaving room and possibility to build
> the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to