Ken Krugler schrieb:
 * mapred.tasktracker.tasks.maximum = 2
 * fetcher.threads.fetch = 10
You should increase those values, too. mapred.tasktracker.tasks.maximum should be higher than the number of cpu cores you have in your boxes. This way the tasktracker can keep all cpus busy. Maybe you try with a value of 8.

I would try threads.fetch = 100 so threads fetching dead urls and waiting for the timeout don't slow down the fetching.

 * mapred.map.tasks = 1000
 * mapred.reduce.tasks = 39
 * mapred.child.heap.size = 500m
The number of map.tasks should be a multiple of the number of available cpus, I set it to 12 or 16 for your setup.

The number of reducer task should be about the number of cpus, 8 in your case. Maybe you should try even a higher number for fetching since the number since the number of mapper task used (the actual fetching) is the same as the number of reducer tasks.

The tasktracker starts a new jvm for every task with child.heap.size as an argument. So mapred.tasktracker.tasks.maximum*mapred.child.heap.size is the maximum amount of free RAM needed after starting the tasktracker, which is set to 1000m in the default configuration.

To get a good crawling performance you should inject a lot of different domains into you webdb first because the fetcher has very polite settings in it's default configuration. It will use only one thread per domain and won't fetch another URL from that domains for 5 secs after. When creating the segment all URLs will be put into one fetchlist for one task. This means with these settings you cannot fetch more than 0.2 pages/s from one domain but i guess your boxes should be able to easily fetch 100+ pages/s per task depending on your available bandwidth.

regards,
Dominik




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to