I will add in your changes and then put it up on the wiki. Dennis
-----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 2:41 PM To: [email protected] Subject: Re: Nutch and Hadoop Tutorial Finished Dennis Kubes wrote: > Here it is for the list, I will try to put it on the wiki as well. Thanks for writing this! I've added a few comments below. > Some things are assumed for this tutorial. First, you will need root level > access to all of the boxes you are deploying to. Root access should not be required (although it is sometimes convenient). I have certainly run large-scale crawls w/o root. > The only way to get Nutch 0.8 Dev as of this writing that I know of is > through Subversion. Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from: http://cvs.apache.org/dist/lucene/hadoop/nightly/ > Add a build.properties file and inside of it add a variable called dist.dir > with its value as the location where you want to build nutch. So if you are > building on a linux machine it would look something like this: > > dist.dir=/path/to/build This is optional. > So log into the master nodes and all of the slave nodes as root. Create the > nutch user and the different filesystems with the following commands: > > mkdir /nutch > mkdir /nutch/search > mkdir /nutch/filesystem > mkdir /nutch/home > > useradd -d /nutch/home -g users > chown -R nutch:users /nutch > passwd nutch nutchuserpassword You can of course run things as any user. I always run things as myself, but that may not be appropriate in all environments. > First we are going to edit the ssh daemon. The line that reads > #PermitUserEnvironment no should be changed to yes and the daemon restarted. > This will need to be done on all nodes. > > vi /etc/ssh/sshd_config > PermitUserEnvironment yes This is not required (although it can be useful). If you see errors from ssh when running scripts, then try changing the value of HADOOP_SSH_OPTS in conf/hadoop-env.sh. > Once we have the ssh daemon configured, the ssh keys created and copied to > all of the nodes we will need to create an environment file for ssh to use. > When nutch logs in to the slave nodes using ssh, the environment file > creates the environment variables for the shell. The environment file is > created under the nutch home .ssh directory. We will create the environment > file on the master node and copy it to all of the slave nodes. > > vi /nutch/home/.ssh/environment > > .. environment variables > > Then copy it to all of the slave nodes using scp: > > scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment One can now instead put environment variables in conf/hadoop-env.sh, since not all versions of ssh support PermitUserEnvironment. > cd /nutch/search > scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search Note that, after the initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will use rsync to update the code running on each slave when you start daemons on that slave. > The first time all of the nodes are started there may be the ssh dialog > asking to add the hosts to the known_hosts file. You will have to type in > yes for each one and hit enter. The output may be a little wierd the first > time but just keep typing yes and hitting enter if the dialogs keep > appearing. A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting bin/start-all.sh. Thanks again for providing this! Doug ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
