[Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Apache Wiki Fri, 11 Nov 2005 02:48:28 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by PaulBaclace:
http://wiki.apache.org/nutch/OverviewDeploymentConfigs

New page:
== Overview of Deployment Configurations in Nutch 0.8 ==
(11/2005 Paul Baclace)

This page describes a range of deployment configurations, the assumptions 
involved, and the relevant property settings.  The primary focus is on a few 
canonical deployments scenarios and surrounding issues.  Relevant properties 
are described, but a complete description of all properties is not attempted 
here.

The process startup sequence is also described in order to see differences 
between different deployments.

Flexibility of assumptions is noted with MUST (rigid) or SHOULD (highly 
recommended, but could be different for the adventurous). 

=== Configuration File Overview ===

When building Nutch, the conf directory has 2 important property files that are 
put into the classpath for lookup at runtime:

 * ''nutch-default.xml'' the place for universal defaults as set by the Nutch 
developers.
 * ''nutch-site.xml'' the highest priority properties that override all other.

The Java System Properties are ''not'' consulted for Nutch properties, so -D 
style commandline overriding is strongly discouraged.  However, System 
Properties are used when standard properties are to be found.  

The bin/nutch sh script places $NUTCH_HOME/conf at the beginning of the 
classpath so that the xml property files can be found.

=== Nutch Shell Scripts ===

A meta-assumption here is that the sh scripts in the nutch bin directory are 
used to start and control the ensemble of processes across many machines.  

The Nutch shell scripts are simple and elegant and they form a call hierarchy, 
starting at the top level:
 1. start_all.sh or stop_all.sh - start and stop whole ensemble.
 2. nutch_daemons.sh - run a Nutch command on all slave hosts.
 3. slaves.sh - run a shell command on all slave hosts.
 4. nutch_daemon.sh - run a Nutch command as a daemon with a start|stop 
argument like a regular Unix/Linux /etc/rc.local script; the process pid is 
stored during start and used during stop.  Runs rsync at start.
 5. nutch - run a Nutch command using the JVM.

Depending upon the context of use, any level of these scripts can be handy on 
the command line.

=== Configuration Assumptions ===

For simplicity of configuration, filenames you pass to commands SHOULD be 
pathnames that work on all hosts. When working with just a few hosts, this 
seems to be a limitation, but it obviously makes a lot of sense when hundreds 
or thousands of machines are involved.

 1. property settings are meant to be the same across hosts; they are SHOULD 
not be customized per host (they are not even settable on the commandline, so 
per-process settings are discouraged).
 2. filenames and paths are meant to be the same across hosts (SHOULD).
 3. Some file paths are ambivalent about NDFS/local filesystem and are 
interpreted depending on which kind of filesystem is in use.
 4. each machine SHOULD have (including the master) nutch installed in the same 
filesystem path.
 5. The ndfs.data.dir and mapred.local.dir properties list comma separated 
directories.  Only those that exist are used.  So not all machines are required 
to have exactly the same devices.

=== System Assumptions ===

  1. The env var NUTCH_MASTER is set to the hostname of the master machine.
  2. The slave nodes are defined by putting list of hostnames, one per line, in 
~/.slaves  (alternatively, use NUTCH_SLAVES to refer to a different file).
  3. a cluster of machines is managed from a master machine, without a firewall 
in bewteen any of the machines (MUST, for simplicity).  Many tcp/ip ports are 
used.
  4. the master machine MUST have a no-password login (ssh) to all the slave 
machines, using the same username.
  5. set environment variables in ~/.ssh/environment, since ssh does not source 
your .bash_profile.  These include JAVA_HOME, NUTCH_LOG_DIR, NUTCH_SLAVES and 
NUTCH_MASTER.
  6. make sure that your NUTCH_LOG_DIR and the directories named in 
ndfs.data.dir exist on all slaves.  This can be done most easily with 
bin/slaves.sh.

=== Deployment Startup Sequences ===

 A. Cluster deployment with too many machines to customize (probably more than 
4; 1000 machines should be possible):

  6. bin/slaves.sh rsync-command is used as needed to update jars and conf 
files from master.
  7. the ensemble starts by running bin/start-all.sh on the master.
  8. start-all.sh uses bin/nutch-daemons.sh run one datanode process on each 
slave (in the background without waiting, one daemon thread is started per 
comma-separated storage device, non-existent storage devices in the list are 
ignored).
  9. start-all.sh runs one namenode and one jobtracker on the master.
  10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker process on 
each slave (in the background without waiting).


 B. Cluster of a few machines:
  1. ''Add more details here''

 C. One developer debugging on one machine:
  1. ''Add more details here''

[Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Reply via email to