[Hadoop Wiki] Update of "UnixShellScriptProgrammingGuide" by SomeOtherAccount

2016-05-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "UnixShellScriptProgrammingGuide" page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diff=20=21

Comment:
More dynamic subcommands updates

  ## page was renamed from ShellScriptProgrammingGuide
  = Introduction =
- 
  With [[https://issues.apache.org/jira/browse/HADOOP-9902|HADOOP-9902]], the 
shell script code base has been refactored, with common functions and utilities 
put into a shell library (hadoop-functions.sh).  Here are some tips and tricks 
to get the most out of using this functionality:
  
  = The Skeleton =
- 
  All properly built shell scripts contain the following sections:
  
   1. `hadoop_usage` function that contains an alphabetized list of subcommands 
and their description.  This is used when the user directly asks for help, a 
command line syntax error, etc.
  
-  2. `HADOOP_LIBEXEC_DIR` configured.  This should be the location of where 
`hadoop-functions.sh`, `hadoop-config.sh`, etc, are located.
+  1. `HADOOP_LIBEXEC_DIR` configured.  This should be the location of where 
`hadoop-functions.sh`, `hadoop-config.sh`, etc, are located.
  
-  3. `HADOOP_NEW_CONFIG=true`.  This tells the rest of the system that the 
code being executed is aware that it is using the new shell API and it will 
call the routines it needs to call on its own.  If this isn't set, then several 
default actions that were done in Hadoop 2.x and earlier are executed and 
several key parts of the functionality are lost.
+  1. `HADOOP_NEW_CONFIG=true`.  This tells the rest of the system that the 
code being executed is aware that it is using the new shell API and it will 
call the routines it needs to call on its own.  If this isn't set, then several 
default actions that were done in Hadoop 2.x and earlier are executed and 
several key parts of the functionality are lost.
  
-  4. `$HADOOP_LIBEXEC_DIR/abc-config.sh` is executed, where abc is the 
subproject.  HDFS scripts should call `hdfs-config.sh`. MAPRED scripts should 
call `mapred-config.sh` YARN scripts should call `yarn-config.sh`.  Everything 
else should call `hadoop-config.sh`. This does a lot of standard 
initialization, processes standard options, etc. This is also what provides 
override capabilities for subproject specific environment variables. For 
example, the system will normally ignore `yarn-env.sh`, but `yarn-config.sh` 
will activate those settings.
+  1. `$HADOOP_LIBEXEC_DIR/abc-config.sh` is executed, where abc is the 
subproject.  HDFS scripts should call `hdfs-config.sh`. MAPRED scripts should 
call `mapred-config.sh` YARN scripts should call `yarn-config.sh`.  Everything 
else should call `hadoop-config.sh`. This does a lot of standard 
initialization, processes standard options, etc. This is also what provides 
override capabilities for subproject specific environment variables. For 
example, the system will normally ignore `yarn-env.sh`, but `yarn-config.sh` 
will activate those settings.
  
-  5. At this point, this is where the majority of your code goes.  Programs 
should process the rest of the arguments and doing whatever their script is 
supposed to do.
+  1. At this point, this is where the majority of your code goes.  Programs 
should process the rest of the arguments and doing whatever their script is 
supposed to do.
  
-  6. Before executing a Java program (preferably via hadoop_java_exec) or 
giving user output, call `hadoop_finalize`.  This finishes up the configuration 
details: adds the user class path, fixes up any missing Java properties, 
configures library paths, etc.  
+  1. Before executing a Java program (preferably via hadoop_java_exec) or 
giving user output, call `hadoop_finalize`.  This finishes up the configuration 
details: adds the user class path, fixes up any missing Java properties, 
configures library paths, etc.
  
-  7. Either an `exit` or an `exec`.  This should return 0 for success and 1 or 
higher for failure.
+  1. Either an `exit` or an `exec`.  This should return 0 for success and 1 or 
higher for failure.
  
- = Adding a Subcommand to an Existing Script =
+ = Adding a Subcommand to an Existing Script (NOT hadoop-tools-based) =
- 
  In order to add a new subcommand, there are two things that need to be done:
  
   1. Add a line to that script's `hadoop_usage` function that lists the name 
of the subcommand and what it does.  This should be alphabetized.
  
-  2. Add an additional entry in the case conditional. Depending upon what is 
being added, several things may need to be done:
+  1. Add an additional entry in the case conditional. Depending upon what is 
being added, several things may need to be done:
+   a. Set the `HADOOP_CLASSNAME` to the Java method. b. Add 
$HADOOP_CLIENT_OPTS to $HADOOP_OPTS (or, for YARN apps, $YARN_CLIENT_OPTS to 
$YARN_OPTS) if this is an interactive application or for some other reason 

[Hadoop Wiki] Update of UnixShellScriptProgrammingGuide by SomeOtherAccount

2015-03-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnixShellScriptProgrammingGuide page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diffrev1=18rev2=19

  
  In addition to all of the variables documented in `*-env.sh` and 
`hadoop-layout.sh`, there are a handful of special env vars:
  
- * `JAVA_HEAP_MAX` - This is the Xmx parameter to be passed to Java. (e.g., 
`-Xmx1g`).  This is present for backward compatibility, however it should be 
added to `HADOOP_OPTS` via `hadoop_add_param HADOOP_OPTS Xmx 
${JAVA_HEAP_MAX}` prior to calling `hadoop_finalize`.
+ * `HADOOP_HEAP_MAX` - This is the Xmx parameter to be passed to Java. (e.g., 
`-Xmx1g`).  This is present for backward compatibility, however it should be 
added to `HADOOP_OPTS` via `hadoop_add_param HADOOP_OPTS Xmx 
${JAVA_HEAP_MAX}` prior to calling `hadoop_finalize`.
  
  * `HADOOP_DAEMON_MODE` - This will be set to `start` or `stop` based upon 
what `hadoop-config.sh` has determined from the command line options.
  


[Hadoop Wiki] Update of UnixShellScriptProgrammingGuide by SomeOtherAccount

2015-03-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnixShellScriptProgrammingGuide page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diffrev1=17rev2=18

  
   * Avoid adding more globals or project specific globals and/or entries in 
*-env.sh and/or a comment at the bottom here.  In a lot of cases, there is 
pre-existing functionality that already does what you might need to do.  
Additionally, every configuration option makes it that much harder for end 
users. If you do need to add a new global variable for additional 
functionality, start it with HADOOP_ for common, HDFS_ for HDFS, YARN_ for 
YARN, and MAPRED_ for MapReduce.  It should be documented in either *-env.sh 
(for user overridable parts) or hadoop-functions.sh (for internal-only 
globals). This helps prevents our variables from clobbering other people.
  
-  * Remember that abc_xyz_OPTS can and should act as a catch-all for Java 
daemon options.  Custom heap environment variables add unnecessary complexity 
for both the user and us.  They should be avoided.
+  * Remember that abc_xyz_OPTS can and should act as a catch-all for Java 
daemon options.  Custom heap environment variables and other custom daemon 
variables add unnecessary complexity for both the user and us.  They should be 
avoided.  In almost every case, it is better to have a global and apply it to 
all daemons to have a universal default.  Users can/will override that 
variables as necessary in their init scripts.  This also helps cover the case 
when functionality starts in one chunk of Hadoop but ends up in multiple places.
  
   * Avoid mutli-level `if`'s where the comparisons are static strings.  Use 
case statements instead, as they are easier to read.
  


[Hadoop Wiki] Update of UnixShellScriptProgrammingGuide by SomeOtherAccount

2014-12-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnixShellScriptProgrammingGuide page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diffrev1=16rev2=17

  
  1-... unless `HADOOP_IDENT_STRING` is modified appropriately. This means that 
post-HADOOP-9902, it is now possible to run two secure datanodes on the same 
machine as the same user, since all of the logs, pids, and outs, take that into 
consideration! QA folks should be very happy.
  
+ = A New Subproject or Subproject-like Structure =
+ 
+ The following files should be the basis of the new bits:
+ 
+ * libexec/(project)-config.sh
+ 
+ This contains the new, common configuration/startup bits. At a minimum, it 
should contain the bootstrap stanza for hadoop-config.sh and a function called 
hadoop_subproject_init that does the actual, extra work that needs to be done.  
Variables should be HADOOP_(project)_(whatever) and should be initialized based 
off of the standard HADOOP_* variables.
+ 
+ * bin/(project) or sbin/(project)
+ 
+ User-facing commands, those should be in bin.  Administrator commands should 
be in sbin.  See the skeleton example up above on how to build these types of 
scripts.
+ 
+ * conf/*-env.sh
+ 
+ Ideally, this should follow the pattern already established by the other 
*-env.sh files.  Entries in it can also be in hadoop-env.sh to provide for a 
consistent and much easier operational experience.
+ 


[Hadoop Wiki] Update of UnixShellScriptProgrammingGuide by SomeOtherAccount

2014-11-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnixShellScriptProgrammingGuide page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diffrev1=14rev2=15

Comment:
Update docs post-HADOOP-11208 which renamed the daemon variable

  
b. Add $HADOOP_CLIENT_OPTS to $HADOOP_OPTS (or, for YARN apps, 
$YARN_CLIENT_OPTS to $YARN_OPTS) if this is an interactive application or for 
some other reason should have the user client settings applied.
  
-   c. For methods that can also be daemons, set `daemon=true`.  This will 
allow for the `--daemon` option to work.
+   c. For methods that can also be daemons, set `supportdaemonization=true`.  
This will allow for the `--daemon` option to work. See more below.
  
d. If it supports security, set `secure_service=true` and `secure_user` 
equal to the user that should run the daemon.
  
@@ -82, +82 @@

  
  * `HADOOP_SLAVE_NAMES` - This is the list of hosts the user passed via 
`--hostnames`.
  
+ 
+ = Daemonization =
+ 
+ n branch-2 and previous, daemons were handled via wrapping standard 
command lines. If we concentrate on the functionality (vs. the code rot...) 
this has some interesting (and inconsistent) results, especially around logging 
and pid files. If you run the *-daemon version, you got a pid file and 
hadoop.root.logger is set to be INFO,(something). When a daemon is run in 
non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated 
and hadoop.root.logger is kept as INFO,console. With no pid file generated, it 
is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in 
straight up mode again. It also means that one needs to pull apart the process 
list to safely determine the status of the daemon since pid files aren't always 
created. This made building custom init scripts fraught with danger. This 
inconsistency has been a point of frustration for many operations teams.
+ 
+ Post-HADOOP-9902, there is a slight change in the above functionality and one 
of the key reasons why this is an incompatible change. Sub-commands that were 
intended to run as daemons (either fully, e.g., namenode or partially, e.g. 
balancer) have all of this handling consolidated, helping to eliminate code rot 
as well as providing a consistent user experience across projects. daemon=true, 
which is a per-script local, but is consistent across the hadoop sub-projects, 
tells the latter parts of the shell code that this sub-command needs to have 
some extra-handling enabled beyond the normal commands. In particular, 
supportdaemonization=true's will always get pid and out files. They will 
prevent two being run on the same machine by the same user simultaneously (see 
footnote 1, however). They get some extra options on the java command line. 
Etc, etc.
+ 
+ So where does --daemon come in? The value of that is stored in a global 
called HADOOP_DAEMON_MODE. If the user doesn't set it specifically, it defaults 
to 'default'. This was done to allow the code to mostly replicate the behavior 
of branch-2 and previous when the *-daemon.sh code was NOT used. In other 
words, --daemon default (or no value provided), let's commands like hdfs 
namenode still run in the foreground, just now with pid and out files. --daemon 
start does the disown (previously a nohup), change the logging output from 
HADOOP_ROOT_LOGGER to HADOOP_DAEMON_ROOT_LOGGER, add some extra command line 
options, etc, etc similar to the *-daemon.sh commands.
+ 
+ What happens if daemon mode is set for all commands? The big thing is the pid 
and out file creation and the checks around it. A user would only ever be able 
to execute one 'hadoop fs' command at a time because of the pid file! Less than 
ideal.
+ 
+ To summarize, supportdaemonization=true tells the code that --daemon actually 
means something to the sub-command. Otherwise, --daemon is ignored.
+ 
+ 1-... unless HADOOP_IDENT_STRING is modified appropriately. This means that 
post-HADOOP-9902, it is now possible to run two secure datanodes on the same 
machine as the same user, since all of the logs, pids, and outs, take that into 
consideration! QA folks should be very happy.
+ 


[Hadoop Wiki] Update of UnixShellScriptProgrammingGuide by SomeOtherAccount

2014-11-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnixShellScriptProgrammingGuide page has been changed by SomeOtherAccount:
https://wiki.apache.org/hadoop/UnixShellScriptProgrammingGuide?action=diffrev1=15rev2=16

Comment:
Minor cleanup of previous big edit.

  * `HADOOP_SLAVE_NAMES` - This is the list of hosts the user passed via 
`--hostnames`.
  
  
- = Daemonization =
+ = About Daemonization =
  
- n branch-2 and previous, daemons were handled via wrapping standard 
command lines. If we concentrate on the functionality (vs. the code rot...) 
this has some interesting (and inconsistent) results, especially around logging 
and pid files. If you run the *-daemon version, you got a pid file and 
hadoop.root.logger is set to be INFO,(something). When a daemon is run in 
non-daemon mode (e.g., straight up: 'hdfs namenode'), no pid file is generated 
and hadoop.root.logger is kept as INFO,console. With no pid file generated, it 
is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in 
straight up mode again. It also means that one needs to pull apart the process 
list to safely determine the status of the daemon since pid files aren't always 
created. This made building custom init scripts fraught with danger. This 
inconsistency has been a point of frustration for many operations teams.
+ In branch-2 and previous, daemons were handled via wrapping standard 
command lines. If we concentrate on the functionality (vs. the code rot...) 
this has some interesting (and inconsistent) results, especially around logging 
and pid files. If you run the `*-daemon` version, you got a pid file and 
`hadoop.root.logger` is set to be `INFO,(something)`. When a daemon is run in 
non-daemon mode (e.g., straight up: `hdfs namenode`), no pid file is generated 
and `hadoop.root.logger` is kept as `INFO,console`. With no pid file generated, 
it is possible to run, e.g. hdfs namenode, both in *-daemon.sh mode and in 
straight up mode again. It also means that one needs to pull apart the process 
list to safely determine the status of the daemon since pid files aren't always 
created. This made building custom init scripts fraught with danger. This 
inconsistency has been a point of frustration for many operations teams.
  
- Post-HADOOP-9902, there is a slight change in the above functionality and one 
of the key reasons why this is an incompatible change. Sub-commands that were 
intended to run as daemons (either fully, e.g., namenode or partially, e.g. 
balancer) have all of this handling consolidated, helping to eliminate code rot 
as well as providing a consistent user experience across projects. daemon=true, 
which is a per-script local, but is consistent across the hadoop sub-projects, 
tells the latter parts of the shell code that this sub-command needs to have 
some extra-handling enabled beyond the normal commands. In particular, 
supportdaemonization=true's will always get pid and out files. They will 
prevent two being run on the same machine by the same user simultaneously (see 
footnote 1, however). They get some extra options on the java command line. 
Etc, etc.
+ Post-HADOOP-9902, there is a slight change in the above functionality and one 
of the key reasons why this is an incompatible change. Sub-commands that were 
intended to run as daemons (either fully, e.g., namenode or partially, e.g. 
balancer) have all of this handling consolidated, helping to eliminate code rot 
as well as providing a consistent user experience across projects. daemon=true, 
which is a per-script local, but is consistent across the hadoop sub-projects, 
tells the latter parts of the shell code that this sub-command needs to have 
some extra-handling enabled beyond the normal commands. In particular, 
`supportdaemonization=true` sub-commands will always get pid and out files. 
They will prevent two being run on the same machine by the same user 
simultaneously (see footnote 1, however). They get some extra options on the 
java command line. Etc, etc.
  
- So where does --daemon come in? The value of that is stored in a global 
called HADOOP_DAEMON_MODE. If the user doesn't set it specifically, it defaults 
to 'default'. This was done to allow the code to mostly replicate the behavior 
of branch-2 and previous when the *-daemon.sh code was NOT used. In other 
words, --daemon default (or no value provided), let's commands like hdfs 
namenode still run in the foreground, just now with pid and out files. --daemon 
start does the disown (previously a nohup), change the logging output from 
HADOOP_ROOT_LOGGER to HADOOP_DAEMON_ROOT_LOGGER, add some extra command line 
options, etc, etc similar to the *-daemon.sh commands.
+ So where does `--daemon` come in? The value of that is stored in a global 
called `HADOOP_DAEMON_MODE`. If the user doesn't set it specifically, it 
defaults to 'default'. This was done to allow the code to mostly replicate the 
behavior of