Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Stefan Groschupf Fri, 11 Nov 2005 04:47:18 -0800

ups, sorry...

Paul, you may should mentioned that this scripts require ssh in aversion higher than 3.8.

A great page!


Stefan

Am 11.11.2005 um 13:45 schrieb Stefan Groschupf:

Am 11.11.2005 um 11:48 schrieb Apache Wiki:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "NutchWiki" for change notification.
The following page has been changed by PaulBaclace:
http://wiki.apache.org/nutch/OverviewDeploymentConfigs

New page:
== Overview of Deployment Configurations in Nutch 0.8 ==
(11/2005 Paul Baclace)
This page describes a range of deployment configurations, theassumptions involved, and the relevant property settings. Theprimary focus is on a few canonical deployments scenarios andsurrounding issues. Relevant properties are described, but acomplete description of all properties is not attempted here.
The process startup sequence is also described in order to seedifferences between different deployments.
Flexibility of assumptions is noted with MUST (rigid) or SHOULD(highly recommended, but could be different for the adventurous).
=== Configuration File Overview ===
When building Nutch, the conf directory has 2 important propertyfiles that are put into the classpath for lookup at runtime:
* ''nutch-default.xml'' the place for universal defaults as setby the Nutch developers.* ''nutch-site.xml'' the highest priority properties thatoverride all other.
The Java System Properties are ''not'' consulted for Nutchproperties, so -D style commandline overriding is stronglydiscouraged. However, System Properties are used when standardproperties are to be found.
The bin/nutch sh script places $NUTCH_HOME/conf at the beginningof the classpath so that the xml property files can be found.
=== Nutch Shell Scripts ===
A meta-assumption here is that the sh scripts in the nutch bindirectory are used to start and control the ensemble of processesacross many machines.
The Nutch shell scripts are simple and elegant and they form acall hierarchy, starting at the top level:
 1. start_all.sh or stop_all.sh - start and stop whole ensemble.
 2. nutch_daemons.sh - run a Nutch command on all slave hosts.
 3. slaves.sh - run a shell command on all slave hosts.
4. nutch_daemon.sh - run a Nutch command as a daemon with a start|stop argument like a regular Unix/Linux /etc/rc.local script; theprocess pid is stored during start and used during stop. Runsrsync at start.
 5. nutch - run a Nutch command using the JVM.
Depending upon the context of use, any level of these scripts canbe handy on the command line.
=== Configuration Assumptions ===
For simplicity of configuration, filenames you pass to commandsSHOULD be pathnames that work on all hosts. When working with justa few hosts, this seems to be a limitation, but it obviously makesa lot of sense when hundreds or thousands of machines are involved.
1. property settings are meant to be the same across hosts; theyare SHOULD not be customized per host (they are not even settableon the commandline, so per-process settings are discouraged).2. filenames and paths are meant to be the same across hosts(SHOULD).3. Some file paths are ambivalent about NDFS/local filesystem andare interpreted depending on which kind of filesystem is in use.4. each machine SHOULD have (including the master) nutchinstalled in the same filesystem path.5. The ndfs.data.dir and mapred.local.dir properties list commaseparated directories. Only those that exist are used. So notall machines are required to have exactly the same devices.
=== System Assumptions ===
1. The env var NUTCH_MASTER is set to the hostname of the mastermachine.2. The slave nodes are defined by putting list of hostnames, oneper line, in ~/.slaves (alternatively, use NUTCH_SLAVES to referto a different file).3. a cluster of machines is managed from a master machine,without a firewall in bewteen any of the machines (MUST, forsimplicity). Many tcp/ip ports are used.4. the master machine MUST have a no-password login (ssh) to allthe slave machines, using the same username.5. set environment variables in ~/.ssh/environment, since sshdoes not source your .bash_profile. These include JAVA_HOME,NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.6. make sure that your NUTCH_LOG_DIR and the directories namedin ndfs.data.dir exist on all slaves. This can be done mosteasily with bin/slaves.sh.
=== Deployment Startup Sequences ===
A. Cluster deployment with too many machines to customize(probably more than 4; 1000 machines should be possible):
6. bin/slaves.sh rsync-command is used as needed to update jarsand conf files from master.
  7. the ensemble starts by running bin/start-all.sh on the master.
8. start-all.sh uses bin/nutch-daemons.sh run one datanodeprocess on each slave (in the background without waiting, onedaemon thread is started per comma-separated storage device, non-existent storage devices in the list are ignored).
  9. start-all.sh runs one namenode and one jobtracker on the master.
10. start-all.sh uses bin/nutch-daemons.sh run one tasktrackerprocess on each slave (in the background without waiting).
 B. Cluster of a few machines:
  1. ''Add more details here''

 C. One developer debugging on one machine:
  1. ''Add more details here''

Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Reply via email to