http://www.mail-archive.com/[email protected]/msg02394.html
"
Teruhiko Kurosaka wrote:
Can I use MapReduce to run Nutch on a multi CPU system?
Yes.
I want to run the index job on two (or four) CPUs
on a single system. I'm not trying to distribute the job
over multiple systems.
If the MapReduce is the way to go,
do I just specify config parameters like these:
mapred.tasktracker.tasks.maxiumum=2
mapred.job.tracker=localhost:9001
mapred.reduce.tasks=2 (or 1?)
and
bin/start-all.sh
?
That should work. You'd probably want to set the default number of map
tasks to be a multiple of the number of CPUs, and the number of reduce
tasks to be exactly the number of cpus.
Don't use start-all.sh, but rather just:
bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker
Must I use NDFS for MapReduce?
No.
Doug
"
Doug Cook wrote:
Hi,
I've recently switched to 0.8 from 0.7, and after some initial fits and
starts, I'm past the "get it working at all" stage to the "get reasonable
performance" stage.
I've got a single machine with 4 CPUs and a lot of memory. URL fetching
works great because it's (mostly) multithreaded. But as soon as I hit the
reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
the phase can take days, leaving me vulnerable to losing everything should a
process fail.
Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
help getting my configuration right. I've seen examples/tutorials of
configurations for multiple machines; am I just "faking" multiple machines
on my single node (will that work?) or is there a cleaner, simpler approach?
Alternatively, I was all excited to get an easy improvement with
-numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
looks like -numFetchers has gone away, and though there was an 0.8 version
patch, at a quick glance this didn't seem to have made it into the mainline
source, and I don't see the value of trying to merge this in if there's a
cleaner Hadoop-based approach.
Many thanks for any help.
Doug