Dominik Friedrich wrote:
To get a good crawling performance you should inject a lot of different domains into you webdb first because the fetcher has very polite settings in it's default configuration. It will use only one thread per domain and won't fetch another URL from that domains for 5 secs after. When creating the segment all URLs will be put into one fetchlist for one task. This means with these settings you cannot fetch more than 0.2 pages/s from one domain but i guess your boxes should be able to easily fetch 100+ pages/s per task depending on your available bandwidth.

Also, I think the current implementation is not optimal, because it runs only a single map task for a fetcher. The reason for this is that it was the easiest way to ensure that we don't violate the politeness rules - if we ran multiple map tasks the methods blockAddr/unblockAddr in protocol-http couldn't prevent other map tasks from using the same address.

The proper solution is IMHO a central lock manager. I looked at the code, it seems to me that JobTracker could manage this central lock manager (one per job? one per cluster? perhaps both?), this could be a part of a JobSubmissionProtocol - but I think there is no way now for the arbitrary code to reference it's JobClient.. bummer.

Some food for thought, anyway.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to