Doug Cutting wrote:
What version of Nutch are you using?

The version of NDFS in the mapred branch is much improved. The crawling code in that branch has also been re-written to be MapReduce-based, and will automatically manage multi-machine fetching, db updates, indexing, etc.

There's not yet much documentation for this version however. Probably the best documentation is in this pdf, and it is spartan:

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf

Here's a quick cheat sheet:

svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
cd mapred
ant

emacs conf/nutch-site.xml
# define fs.default.name to be masterHost:XXXX
# define mapred.job.tracker to be masterHost:YYYY

emacs conf/mapred-default.xml
# define mapred.map.tasks to be multiple of # of slave hosts
# define mapred.reduce tasks to be # of slave hosts

# make a file with slave host names
echo slave1 >> ~/.slaves
echo slave2 >> ~/.slaves
echo slave3 >> ~/.slaves

# start all ndfs & mapred daemons
bin/start-all.sh

# make a directory with seed list file
mkdir seeds
echo http://lucene.apache.org/nutch/ > seeds/urls

# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds

# crawl a bit
bin/nutch crawl seeds -depth 3

# monitor things from adminstrative interface
firefox masterHost:7845

If you try this, please tell us how it goes.

Doug

.


Hi,

This cheat sheet worked perfectly !!! first time !!!

And all I can say is wow. Looks great.

Gal.

Reply via email to