Dennis Kubes wrote:
Here it is for the list, I will try to put it on the wiki as well.
Thanks for writing this! I've added a few comments below.
Some things are assumed for this tutorial. First, you will need root level access to all of the boxes you are deploying to.
Root access should not be required (although it is sometimes convenient). I have certainly run large-scale crawls w/o root.
The only way to get Nutch 0.8 Dev as of this writing that I know of is through Subversion.
Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from: http://cvs.apache.org/dist/lucene/hadoop/nightly/
Add a build.properties file and inside of it add a variable called dist.dir with its value as the location where you want to build nutch. So if you are building on a linux machine it would look something like this: dist.dir=/path/to/build
This is optional.
So log into the master nodes and all of the slave nodes as root. Create the nutch user and the different filesystems with the following commands: mkdir /nutch mkdir /nutch/search mkdir /nutch/filesystem mkdir /nutch/home useradd -d /nutch/home -g users chown -R nutch:users /nutch passwd nutch nutchuserpassword
You can of course run things as any user. I always run things as myself, but that may not be appropriate in all environments.
First we are going to edit the ssh daemon. The line that reads #PermitUserEnvironment no should be changed to yes and the daemon restarted. This will need to be done on all nodes. vi /etc/ssh/sshd_config PermitUserEnvironment yes
This is not required (although it can be useful). If you see errors from ssh when running scripts, then try changing the value of HADOOP_SSH_OPTS in conf/hadoop-env.sh.
Once we have the ssh daemon configured, the ssh keys created and copied to all of the nodes we will need to create an environment file for ssh to use. When nutch logs in to the slave nodes using ssh, the environment file creates the environment variables for the shell. The environment file is created under the nutch home .ssh directory. We will create the environment file on the master node and copy it to all of the slave nodes. vi /nutch/home/.ssh/environment .. environment variables Then copy it to all of the slave nodes using scp: scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment
One can now instead put environment variables in conf/hadoop-env.sh, since not all versions of ssh support PermitUserEnvironment.
cd /nutch/search scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search
Note that, after the initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will use rsync to update the code running on each slave when you start daemons on that slave.
The first time all of the nodes are started there may be the ssh dialog asking to add the hosts to the known_hosts file. You will have to type in yes for each one and hit enter. The output may be a little wierd the first time but just keep typing yes and hitting enter if the dialogs keep appearing.
A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting bin/start-all.sh.
Thanks again for providing this! Doug ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
