[Nutch-general] RE: Nutch and Hadoop Tutorial Finished

Dennis Kubes Mon, 20 Mar 2006 12:46:00 -0800

I will add in your changes and then put it up on the wiki.

Dennis


-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 20, 2006 2:41 PM
To: [email protected]
Subject: Re: Nutch and Hadoop Tutorial Finished

Dennis Kubes wrote:
> Here it is for the list, I will try to put it on the wiki as well.

Thanks for writing this!

I've added a few comments below.

> Some things are assumed for this tutorial.  First, you will need root
level
> access to all of the boxes you are deploying to.

Root access should not be required (although it is sometimes 
convenient).  I have certainly run large-scale crawls w/o root.

> The only way to get Nutch 0.8 Dev as of this writing that I know of is
> through Subversion.

Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from:

http://cvs.apache.org/dist/lucene/hadoop/nightly/

> Add a build.properties file and inside of it add a variable called
dist.dir
> with its value as the location where you want to build nutch. So if you
are
> building on a linux machine it would look something  like this:
> 
> dist.dir=/path/to/build

This is optional.

> So log into the master nodes and all of the slave nodes as root. Create
the
> nutch user and the different filesystems with the following commands:
> 
> mkdir /nutch
> mkdir /nutch/search
> mkdir /nutch/filesystem
> mkdir /nutch/home
> 
> useradd -d /nutch/home -g users
> chown -R nutch:users /nutch
> passwd nutch nutchuserpassword

You can of course run things as any user.  I always run things as 
myself, but that may not be appropriate in all environments.

> First we are going to edit the ssh daemon.  The line that reads
> #PermitUserEnvironment no should be changed to yes and the daemon
restarted.
> This will need to be done on all nodes.
> 
> vi /etc/ssh/sshd_config
> PermitUserEnvironment yes

This is not required (although it can be useful).  If you see errors 
from ssh when running scripts, then try changing the value of 
HADOOP_SSH_OPTS in conf/hadoop-env.sh.

> Once we have the ssh daemon configured, the ssh keys created and copied to
> all of the nodes we will need to create an environment file for ssh to
use.
> When nutch logs in to the slave nodes using ssh, the environment file
> creates the environment variables for the shell.  The environment file is
> created under the nutch home .ssh directory.  We will create the
environment
> file on the master node and copy it to all of the slave nodes.
> 
> vi /nutch/home/.ssh/environment
> 
> .. environment variables
> 
> Then copy it to all of the slave nodes using scp:
> 
> scp /nutch/home/.ssh/environment
[EMAIL PROTECTED]:/nutch/home/.ssh/environment

One can now instead put environment variables in conf/hadoop-env.sh, 
since not all versions of ssh support PermitUserEnvironment.

> cd /nutch/search
> scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search

Note that, after the initial copy, you can set NUTCH_MASTER in your 
conf/hadoop-env.sh and it will use rsync to update the code running on 
each slave when you start daemons on that slave.

> The first time all of the nodes are started there may be the ssh dialog
> asking to add the hosts to the known_hosts file.  You will have to type in
> yes for each one and hit enter.  The output may be a little wierd the
first
> time but just keep typing yes and hitting enter if the dialogs keep
> appearing.

A command like 'bin/slaves.sh uptime' is a good way to test that things 
are configured correctly before attempting bin/start-all.sh.

Thanks again for providing this!

Doug



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Nutch and Hadoop Tutorial Finished

Reply via email to