Doug Cutting wrote:
What version of Nutch are you using?
The version of NDFS in the mapred branch is much improved. The
crawling code in that branch has also been re-written to be
MapReduce-based, and will automatically manage multi-machine fetching,
db updates, indexing, etc.
There's not yet much documentation for this version however. Probably
the best documentation is in this pdf, and it is spartan:
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf
Here's a quick cheat sheet:
svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
cd mapred
ant
emacs conf/nutch-site.xml
# define fs.default.name to be masterHost:XXXX
# define mapred.job.tracker to be masterHost:YYYY
emacs conf/mapred-default.xml
# define mapred.map.tasks to be multiple of # of slave hosts
# define mapred.reduce tasks to be # of slave hosts
# make a file with slave host names
echo slave1 >> ~/.slaves
echo slave2 >> ~/.slaves
echo slave3 >> ~/.slaves
# start all ndfs & mapred daemons
bin/start-all.sh
# make a directory with seed list file
mkdir seeds
echo http://lucene.apache.org/nutch/ > seeds/urls
# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds
# crawl a bit
bin/nutch crawl seeds -depth 3
# monitor things from adminstrative interface
firefox masterHost:7845
If you try this, please tell us how it goes.
Doug
.
Hi,
This cheat sheet worked perfectly !!! first time !!!
And all I can say is wow. Looks great.
Gal.