I'm toying around with the idea of implementing the fetcher as a series of 
event queues (ala SEDA) instead of with threads. This is done by breaking up 
the fetching operation into a series of stages connected by queues, instead of 
one fetcherthread per task.

The stages I see are:

1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring

Each of these stages will be handled in its own thread (except for HTML parsing 
and scoring, which may actually benefit from having multiple threads). With the 
introduction of non-blocking IO, I think threads should be used only where 
parallel computation offers performance advantages.

Breaking up HttpRequest and HttpResponse, will also pave the way for a 
non-blocking HTTP implementation.

A big advantage also arises from a decrease in programmatic complexity (and 
possibly performance). With most of the stages being guaranteed to be 
single-threaded, threading/synchronization issues are dramatically reduced. 
This may not be so evident in the current/map-red fetch code, but because of 
the completely online nature of nutch-84/OC, this does simplify things 
considerably.

I'll need to dig abit more to see how this can be conceptually translated into 
map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then 
reduced?

Any thoughts?

Reply via email to