Dennis Kubes wrote:
Here it is for the list, I will try to put it on the wiki as well.

Thanks for writing this!

I've added a few comments below.

Some things are assumed for this tutorial.  First, you will need root level
access to all of the boxes you are deploying to.

Root access should not be required (although it is sometimes convenient). I have certainly run large-scale crawls w/o root.

The only way to get Nutch 0.8 Dev as of this writing that I know of is
through Subversion.

Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from:

http://cvs.apache.org/dist/lucene/hadoop/nightly/

Add a build.properties file and inside of it add a variable called  dist.dir
with its value as the location where you want to build nutch. So if you are
building on a linux machine it would look something  like this:

dist.dir=/path/to/build

This is optional.

So log into the master nodes and all of the slave nodes as root. Create the
nutch user and the different filesystems with the following commands:

mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/home

useradd -d /nutch/home -g users
chown -R nutch:users /nutch
passwd nutch nutchuserpassword

You can of course run things as any user. I always run things as myself, but that may not be appropriate in all environments.

First we are going to edit the ssh daemon.  The line that reads
#PermitUserEnvironment no should be changed to yes and the daemon restarted.
This will need to be done on all nodes.

vi /etc/ssh/sshd_config
PermitUserEnvironment yes

This is not required (although it can be useful). If you see errors from ssh when running scripts, then try changing the value of HADOOP_SSH_OPTS in conf/hadoop-env.sh.

Once we have the ssh daemon configured, the ssh keys created and copied to
all of the nodes we will need to create an environment file for ssh to use.
When nutch logs in to the slave nodes using ssh, the environment file
creates the environment variables for the shell.  The environment file is
created under the nutch home .ssh directory.  We will create the environment
file on the master node and copy it to all of the slave nodes.

vi /nutch/home/.ssh/environment

.. environment variables

Then copy it to all of the slave nodes using scp:

scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment

One can now instead put environment variables in conf/hadoop-env.sh, since not all versions of ssh support PermitUserEnvironment.

cd /nutch/search
scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search

Note that, after the initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will use rsync to update the code running on each slave when you start daemons on that slave.

The first time all of the nodes are started there may be the ssh dialog
asking to add the hosts to the known_hosts file.  You will have to type in
yes for each one and hit enter.  The output may be a little wierd the first
time but just keep typing yes and hitting enter if the dialogs keep
appearing.

A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting bin/start-all.sh.

Thanks again for providing this!

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to