ups, sorry...
Paul, you may should mentioned that this scripts require ssh in a
version higher than 3.8.
A great page!
Stefan
Am 11.11.2005 um 13:45 schrieb Stefan Groschupf:
Am 11.11.2005 um 11:48 schrieb Apache Wiki:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch
Wiki" for change notification.
The following page has been changed by PaulBaclace:
http://wiki.apache.org/nutch/OverviewDeploymentConfigs
New page:
== Overview of Deployment Configurations in Nutch 0.8 ==
(11/2005 Paul Baclace)
This page describes a range of deployment configurations, the
assumptions involved, and the relevant property settings. The
primary focus is on a few canonical deployments scenarios and
surrounding issues. Relevant properties are described, but a
complete description of all properties is not attempted here.
The process startup sequence is also described in order to see
differences between different deployments.
Flexibility of assumptions is noted with MUST (rigid) or SHOULD
(highly recommended, but could be different for the adventurous).
=== Configuration File Overview ===
When building Nutch, the conf directory has 2 important property
files that are put into the classpath for lookup at runtime:
* ''nutch-default.xml'' the place for universal defaults as set
by the Nutch developers.
* ''nutch-site.xml'' the highest priority properties that
override all other.
The Java System Properties are ''not'' consulted for Nutch
properties, so -D style commandline overriding is strongly
discouraged. However, System Properties are used when standard
properties are to be found.
The bin/nutch sh script places $NUTCH_HOME/conf at the beginning
of the classpath so that the xml property files can be found.
=== Nutch Shell Scripts ===
A meta-assumption here is that the sh scripts in the nutch bin
directory are used to start and control the ensemble of processes
across many machines.
The Nutch shell scripts are simple and elegant and they form a
call hierarchy, starting at the top level:
1. start_all.sh or stop_all.sh - start and stop whole ensemble.
2. nutch_daemons.sh - run a Nutch command on all slave hosts.
3. slaves.sh - run a shell command on all slave hosts.
4. nutch_daemon.sh - run a Nutch command as a daemon with a start|
stop argument like a regular Unix/Linux /etc/rc.local script; the
process pid is stored during start and used during stop. Runs
rsync at start.
5. nutch - run a Nutch command using the JVM.
Depending upon the context of use, any level of these scripts can
be handy on the command line.
=== Configuration Assumptions ===
For simplicity of configuration, filenames you pass to commands
SHOULD be pathnames that work on all hosts. When working with just
a few hosts, this seems to be a limitation, but it obviously makes
a lot of sense when hundreds or thousands of machines are involved.
1. property settings are meant to be the same across hosts; they
are SHOULD not be customized per host (they are not even settable
on the commandline, so per-process settings are discouraged).
2. filenames and paths are meant to be the same across hosts
(SHOULD).
3. Some file paths are ambivalent about NDFS/local filesystem and
are interpreted depending on which kind of filesystem is in use.
4. each machine SHOULD have (including the master) nutch
installed in the same filesystem path.
5. The ndfs.data.dir and mapred.local.dir properties list comma
separated directories. Only those that exist are used. So not
all machines are required to have exactly the same devices.
=== System Assumptions ===
1. The env var NUTCH_MASTER is set to the hostname of the master
machine.
2. The slave nodes are defined by putting list of hostnames, one
per line, in ~/.slaves (alternatively, use NUTCH_SLAVES to refer
to a different file).
3. a cluster of machines is managed from a master machine,
without a firewall in bewteen any of the machines (MUST, for
simplicity). Many tcp/ip ports are used.
4. the master machine MUST have a no-password login (ssh) to all
the slave machines, using the same username.
5. set environment variables in ~/.ssh/environment, since ssh
does not source your .bash_profile. These include JAVA_HOME,
NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
6. make sure that your NUTCH_LOG_DIR and the directories named
in ndfs.data.dir exist on all slaves. This can be done most
easily with bin/slaves.sh.
=== Deployment Startup Sequences ===
A. Cluster deployment with too many machines to customize
(probably more than 4; 1000 machines should be possible):
6. bin/slaves.sh rsync-command is used as needed to update jars
and conf files from master.
7. the ensemble starts by running bin/start-all.sh on the master.
8. start-all.sh uses bin/nutch-daemons.sh run one datanode
process on each slave (in the background without waiting, one
daemon thread is started per comma-separated storage device, non-
existent storage devices in the list are ignored).
9. start-all.sh runs one namenode and one jobtracker on the master.
10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker
process on each slave (in the background without waiting).
B. Cluster of a few machines:
1. ''Add more details here''
C. One developer debugging on one machine:
1. ''Add more details here''