I'm toying around with the idea of implementing the fetcher as a series of event queues (ala SEDA) instead of with threads. This is done by breaking up the fetching operation into a series of stages connected by queues, instead of one fetcherthread per task.
The stages I see are: 1. CrawlStarter (url injection) 2. URL filtering and normalizing 3. HttpRequest 4. HttpResponse 5. DB of fetched MD5 hashes 6. DB of fetched URLs 7. Parse and link extraction 8. Output 9. Link/Page Scoring Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages. Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation. A big advantage also arises from a decrease in programmatic complexity (and possibly performance). With most of the stages being guaranteed to be single-threaded, threading/synchronization issues are dramatically reduced. This may not be so evident in the current/map-red fetch code, but because of the completely online nature of nutch-84/OC, this does simplify things considerably. I'll need to dig abit more to see how this can be conceptually translated into map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then reduced? Any thoughts?
