Ken Krugler schrieb:
* mapred.tasktracker.tasks.maximum = 2
* fetcher.threads.fetch = 10
You should increase those values, too. mapred.tasktracker.tasks.maximum
should be higher than the number of cpu cores you have in your boxes.
This way the tasktracker can keep all cpus busy. Maybe you try with a
value of 8.
I would try threads.fetch = 100 so threads fetching dead urls and
waiting for the timeout don't slow down the fetching.
* mapred.map.tasks = 1000
* mapred.reduce.tasks = 39
* mapred.child.heap.size = 500m
The number of map.tasks should be a multiple of the number of available
cpus, I set it to 12 or 16 for your setup.
The number of reducer task should be about the number of cpus, 8 in your
case. Maybe you should try even a higher number for fetching since the
number since the number of mapper task used (the actual fetching) is the
same as the number of reducer tasks.
The tasktracker starts a new jvm for every task with child.heap.size as
an argument. So mapred.tasktracker.tasks.maximum*mapred.child.heap.size
is the maximum amount of free RAM needed after starting the tasktracker,
which is set to 1000m in the default configuration.
To get a good crawling performance you should inject a lot of different
domains into you webdb first because the fetcher has very polite
settings in it's default configuration. It will use only one thread per
domain and won't fetch another URL from that domains for 5 secs after.
When creating the segment all URLs will be put into one fetchlist for
one task. This means with these settings you cannot fetch more than 0.2
pages/s from one domain but i guess your boxes should be able to easily
fetch 100+ pages/s per task depending on your available bandwidth.
regards,
Dominik
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general