[32/50] [abbrv] hadoop git commit: MAPREDUCE-6995. Uploader tool for Distributed Cache Deploy documentation ([email protected] via rkanter)

aengineer Wed, 24 Jan 2018 14:36:07 -0800

MAPREDUCE-6995. Uploader tool for Distributed Cache Deploy documentation 
([email protected] via rkanter)



Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/836643d7
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/836643d7
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/836643d7

Branch: refs/heads/HDFS-7240
Commit: 836643d793c68bf1bee883abece84f024591da7c
Parents: 62c9e7f
Author: Robert Kanter <[email protected]>
Authored: Fri Jan 19 17:55:24 2018 -0800
Committer: Robert Kanter <[email protected]>
Committed: Fri Jan 19 17:57:54 2018 -0800

----------------------------------------------------------------------
 .../site/markdown/DistributedCacheDeploy.md.vm  | 61 ++++++++++++++---
 .../src/site/markdown/MapredCommands.md         | 19 ++++++
 .../mapred/uploader/FrameworkUploader.java      | 48 +++++++++----
 .../mapred/uploader/TestFrameworkUploader.java  | 72 ++++++++++++++++++++
 4 files changed, 178 insertions(+), 22 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/836643d7/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
index c69be1c..4552235 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
@@ -12,10 +12,6 @@
   limitations under the License. See accompanying LICENSE file.
 -->
 
-#set ( $H3 = '###' )
-#set ( $H4 = '####' )
-#set ( $H5 = '#####' )
-
 Hadoop: Distributed Cache Deploy
 ================================
 
@@ -55,23 +51,41 @@ Deploying a new MapReduce version consists of three steps:
 1.  Upload the MapReduce archive to a location that can be accessed by the
     job submission client. Ideally the archive should be on the cluster's 
default
     filesystem at a publicly-readable path. See the archive location discussion
-    below for more details.
+    below for more details. You can use the framework uploader tool to perform
+    this step like
+    `mapred frameworkuploader -target
+    
hdfs:///mapred/framework/hadoop-mapreduce-${project.version}.tar#mrframework`.
+    It will select the jar files that are in the classpath and put them into
+    a tar archive specified by the -target and -fs options. The tool then 
returns
+    a suggestion of how to set `mapreduce.application.framework.path` and
+    `mapreduce.application.classpath`.
+
+    `-fs`: The target file system. Defaults to the default filesystem set by
+    `fs.defaultFS`.
+
+    `-target` is the target location of the framework tarball, optionally 
followed
+     by a # with the localized alias. It then uploads the tar to the specified
+     directory. gzip is not needed since the jar files are already compressed.
+     Make sure the target directory is readable by all users but it is not
+     writable by others than administrators to protect cluster security.
 
 2.  Configure `mapreduce.application.framework.path` to point to the
     location where the archive is located. As when specifying distributed cache
     files for a job, this is a URL that also supports creating an alias for the
     archive if a URL fragment is specified. For example,
-    
`hdfs:/mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework`
+    
`hdfs:///mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework`
     will be localized as `mrframework` rather than
     `hadoop-mapreduce-${project.version}.tar.gz`.
 
 3.  Configure `mapreduce.application.classpath` to set the proper
-    classpath to use with the MapReduce archive configured above. NOTE: An 
error
+    classpath to use with the MapReduce archive configured above.
+    If the `frameworkuploader` tool is used, it uploads all dependencies
+    and returns the value that needs to be configured here. NOTE: An error
     occurs if `mapreduce.application.framework.path` is configured but
     `mapreduce.application.classpath` does not reference the base name of the
     archive path or the alias if an alias was specified.
 
-$H3 Location of the MapReduce Archive and How It Affects Job Performance
+### Location of the MapReduce Archive and How It Affects Job Performance
 
 Note that the location of the MapReduce archive can be critical to job 
submission and job startup performance. If the archive is not located on the 
cluster's default filesystem then it will be copied to the job staging 
directory for each job and localized to each node where the job's tasks run. 
This will slow down job submission and task startup performance.
 
@@ -79,6 +93,16 @@ If the archive is located on the default filesystem then the 
job client will not
 
 When working with a large cluster it can be important to increase the 
replication factor of the archive to increase its availability. This will 
spread the load when the nodes in the cluster localize the archive for the 
first time.
 
+The `frameworkuploader` tool mentioned above has additional parameters that 
help to adjust performance:
+
+`-initialReplication`: This is the replication count that the framework 
tarball is created with. It is safe to leave this value at the default 3. This 
is the tested scenario.
+
+`-finalReplication`: The uploader tool sets the replication once all blocks 
are collected and uploaded. If quick initial startup is required, then it is 
advised to set this to the commissioned node count divided by two but not more 
than 512. This will leverage HDFS to spread the tarball in a distributed 
manner. Once the jobs start they will likely hit a local HDFS node to localize 
from or they can select from a wide set of additional source nodes. If this is 
is set to a low value like 10, then the output bandwidth of those replicated 
nodes will affect how fast the first job will run. The replication count can be 
manually reduced to a low value like 10 once all the jobs started in the 
cluster to save disk space.
+
+`-acceptableReplication`: The tool will wait until the tarball has been 
replicated this number of times before exiting. This should be a replication 
count less than or equal to the value in `finalReplication`. This is typically 
a 90% of the value in `finalReplication` to accomodate failing nodes.
+
+`-timeout`: A timeout in seconds to wait to reach `acceptableReplication` 
before the tool exits. The tool logs an error otherwise and returns.
+
 MapReduce Archives and Classpath Configuration
 ----------------------------------------------
 
@@ -90,7 +114,17 @@ Another possible approach is to have the archive consist of 
just the MapReduce j
 
 
`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*`
 
-$H3 NOTE:
+The `frameworkuploader` tool has the following arguments to control which jars 
end up in the framework tarball:
+
+`-input`: This is the input classpath that is iterated through. jars files 
found will be added to the tarball. It defaults to the classpath as returned by 
the `hadoop classpath` command.
+
+`-blacklist`: This is a comma separated regex array to filter the jar file 
names to exclude from the class path. It can be used for example to exclude 
test jars or Hadoop services that are not necessary to localize.
+
+`-whitelist`: This is a comma separated regex array to include certain jar 
files. This can be used to provide additional security, so that no external 
source can include malicious code in the classpath when the tool runs.
+
+`-nosymlink`: This flag can be used to exclude symlinks that point to the same 
directory. This is not widely used. For example, `/a/foo.jar` and a symlink 
`/a/bar.jar` that points to `/a/foo.jar` would normally add `foo.jar` and 
`bar.jar` to the tarball as separate files despite them actually being the same 
file. This flag would make the tool exclude `/a/bar.jar` so only one copy of 
the file is added.
+
+### NOTE:
 
 If shuffle encryption is also enabled in the cluster, then we could meet the 
problem that MR job get failed with exception like below:
 
@@ -119,3 +153,12 @@ If shuffle encryption is also enabled in the cluster, then 
we could meet the pro
     ....
 
 This is because MR client (deployed from HDFS) cannot access ssl-client.xml in 
local FS under directory of $HADOOP\_CONF\_DIR. To fix the problem, we can add 
the directory with ssl-client.xml to the classpath of MR which is specified in 
"mapreduce.application.classpath" as mentioned above. To avoid MR application 
being affected by other local configurations, it is better to create a 
dedicated directory for putting ssl-client.xml, e.g. a sub-directory under 
$HADOOP\_CONF\_DIR, like: $HADOOP\_CONF\_DIR/security.
+
+The framework upload tool can be use to collect cluster jars that the MapReduce
+AM, mappers and reducers will use.
+It returns logs that provide the suggested configuration values
+
+    INFO uploader.FrameworkUploader: Uploaded 
hdfs://mynamenode/mapred/framework/mr-framework.tar#mr-framework
+    INFO uploader.FrameworkUploader: Suggested mapreduce.application.classpath 
$PWD/mr-framework/*
+
+Set `mapreduce.application.framework.path` to the first and 
`mapreduce.application.classpath` to the second logged value above respectively.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/hadoop/blob/836643d7/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/MapredCommands.md
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/MapredCommands.md
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/MapredCommands.md
index df07024..176fcac 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/MapredCommands.md
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/MapredCommands.md
@@ -162,4 +162,23 @@ Usage: `mapred hsadmin [-refreshUserToGroupsMappings] | 
[-refreshSuperUserGroups
 | -getGroups [username] | Get the groups which given user belongs to |
 | -help [cmd] | Displays help for the given command or all commands if none is 
specified. |
 
+### `frameworkuploader`
+
+Collects framework jars and uploads them to HDFS as a tarball.
+
+Usage: `mapred frameworkuploader -target <target> [-fs <filesystem>] [-input 
<classpath>] [-blacklist <list>] [-whitelist <list>] [-initialReplication 
<num>] [-acceptableReplication <num>] [-finalReplication <num>] [-timeout 
<seconds>] [-nosymlink]`
+
+| COMMAND\_OPTION | Description |
+|:---- |:---- |
+| -input *classpath* | This is the input classpath that is searched for jar 
files to be included in the tarball. |
+| -fs *filesystem* | The target file system. Defaults to the default 
filesystem set by fs.defaultFS. |
+| -target *target* | This is the target location of the framework tarball, 
optionally followed by a # with the localized alias. An example would be 
/usr/lib/framework.tar#framework. Make sure the target directory is readable by 
all users but it is not writable by others than administrators to protect 
cluster security.
+| -blacklist *list* | This is a comma separated regex array to filter the jar 
file names to exclude from the class path. It can be used for example to 
exclude test jars or Hadoop services that are not necessary to localize. |
+| -whitelist *list* | This is a comma separated regex array to include certain 
jar files. This can be used to provide additional security, so that no external 
source can include malicious code in the classpath when the tool runs. |
+| -nosymlink | This flag can be used to exclude symlinks that point to the 
same directory. This is not widely used. For example, `/a/foo.jar` and a 
symlink `/a/bar.jar` that points to `/a/foo.jar` would normally add `foo.jar` 
and `bar.jar` to the tarball as separate files despite them actually being the 
same file. This flag would make the tool exclude `/a/bar.jar` so only one copy 
of the file is added. |
+| -initialReplication *num* | This is the replication count that the framework 
tarball is created with. It is safe to leave this value at the default 3. This 
is the tested scenario. |
+| -finalReplication *num* | The uploader tool sets the replication once all 
blocks are collected and uploaded. If quick initial startup is required, then 
it is advised to set this to the commissioned node count divided by two but not 
more than 512. |
+| -acceptableReplication *num* | The tool will wait until the tarball has been 
replicated this number of times before exiting. This should be a replication 
count less than or equal to the value in `finalReplication`. This is typically 
a 90% of the value in `finalReplication` to accomodate failing nodes. |
+| -timeout *seconds* | A timeout in seconds to wait to reach 
`acceptableReplication` before the tool exits. The tool logs an error otherwise 
and returns.
+
 

http://git-wip-us.apache.org/repos/asf/hadoop/blob/836643d7/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java
index ee482d7..5316f38 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/main/java/org/apache/hadoop/mapred/uploader/FrameworkUploader.java
@@ -56,6 +56,8 @@ import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.zip.GZIPOutputStream;
 
+import static org.apache.hadoop.fs.FileSystem.FS_DEFAULT_NAME_KEY;
+
 /**
  * Upload a MapReduce framework tarball to HDFS.
  * Usage:
@@ -67,6 +69,7 @@ public class FrameworkUploader implements Runnable {
       Pattern.compile(Shell.getEnvironmentVariableRegex());
   private static final Logger LOG =
       LoggerFactory.getLogger(FrameworkUploader.class);
+  private Configuration conf = new Configuration();
 
   @VisibleForTesting
   String input = null;
@@ -98,6 +101,11 @@ public class FrameworkUploader implements Runnable {
   private OutputStream targetStream = null;
   private String alias = null;
 
+  @VisibleForTesting
+  void setConf(Configuration configuration) {
+    conf = configuration;
+  }
+
   private void printHelp(Options options) {
     HelpFormatter formatter = new HelpFormatter();
     formatter.printHelp("mapred frameworkuploader", options);
@@ -168,7 +176,7 @@ public class FrameworkUploader implements Runnable {
           target.substring(lastIndex + 1) :
           targetPath.getName();
       LOG.info("Target " + targetPath);
-      FileSystem fileSystem = targetPath.getFileSystem(new Configuration());
+      FileSystem fileSystem = targetPath.getFileSystem(conf);
 
       targetStream = null;
       if (fileSystem instanceof DistributedFileSystem) {
@@ -205,7 +213,7 @@ public class FrameworkUploader implements Runnable {
 
   private long getSmallestReplicatedBlockCount()
       throws IOException {
-    FileSystem fileSystem = targetPath.getFileSystem(new Configuration());
+    FileSystem fileSystem = targetPath.getFileSystem(conf);
     FileStatus status = fileSystem.getFileStatus(targetPath);
     long length = status.getLen();
     HashMap<Long, Integer> blockCount = new HashMap<>();
@@ -221,7 +229,8 @@ public class FrameworkUploader implements Runnable {
     for(BlockLocation location: locations) {
       final int replicas = location.getHosts().length;
       blockCount.compute(
-          location.getOffset(), (key, value) -> value + replicas);
+          location.getOffset(),
+          (key, value) -> value == null ? 0 : value + replicas);
     }
 
     // Print out the results
@@ -236,7 +245,7 @@ public class FrameworkUploader implements Runnable {
 
   private void endUpload()
       throws IOException, InterruptedException {
-    FileSystem fileSystem = targetPath.getFileSystem(new Configuration());
+    FileSystem fileSystem = targetPath.getFileSystem(conf);
     if (fileSystem instanceof DistributedFileSystem) {
       fileSystem.setReplication(targetPath, finalReplication);
       LOG.info("Set replication to " +
@@ -428,7 +437,7 @@ public class FrameworkUploader implements Runnable {
     opts.addOption(OptionBuilder.create("h"));
     opts.addOption(OptionBuilder.create("help"));
     opts.addOption(OptionBuilder
-        .withDescription("Input class path")
+        .withDescription("Input class path. Defaults to the default 
classpath.")
         .hasArg().create("input"));
     opts.addOption(OptionBuilder
         .withDescription(
@@ -502,20 +511,33 @@ public class FrameworkUploader implements Runnable {
     }
     String fs = parser.getCommandLine()
         .getOptionValue("fs", null);
+    String path = parser.getCommandLine().getOptionValue("target",
+        "/usr/lib/mr-framework.tar.gz#mr-framework");
+    boolean isFullPath =
+        path.startsWith("hdfs://") ||
+        path.startsWith("file://");
+
     if (fs == null) {
-      LOG.error("Target file system not specified");
-      printHelp(opts);
-      return false;
+      fs = conf.get(FS_DEFAULT_NAME_KEY);
+      if (fs == null && !isFullPath) {
+        LOG.error("No filesystem specified in either fs or target.");
+        printHelp(opts);
+        return false;
+      } else {
+        LOG.info(String.format(
+            "Target file system not specified. Using default %s", fs));
+      }
     }
-    String path = parser.getCommandLine().getOptionValue("target",
-        "mr-framework.tar.gz#mr-framework");
-    if (path == null) {
+    if (path.isEmpty()) {
       LOG.error("Target directory not specified");
       printHelp(opts);
       return false;
     }
-    StringBuilder absolutePath = new StringBuilder(fs);
-    absolutePath = absolutePath.append(path.startsWith("/") ? "" : "/");
+    StringBuilder absolutePath = new StringBuilder();
+    if (!isFullPath) {
+      absolutePath.append(fs);
+      absolutePath.append(path.startsWith("/") ? "" : "/");
+    }
     absolutePath.append(path);
     target = absolutePath.toString();
 

http://git-wip-us.apache.org/repos/asf/hadoop/blob/836643d7/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/test/java/org/apache/hadoop/mapred/uploader/TestFrameworkUploader.java
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/test/java/org/apache/hadoop/mapred/uploader/TestFrameworkUploader.java
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/test/java/org/apache/hadoop/mapred/uploader/TestFrameworkUploader.java
index 61c0b12..c12902c 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/test/java/org/apache/hadoop/mapred/uploader/TestFrameworkUploader.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-uploader/src/test/java/org/apache/hadoop/mapred/uploader/TestFrameworkUploader.java
@@ -23,6 +23,7 @@ import 
org.apache.commons.compress.archivers.tar.TarArchiveEntry;
 import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
 import org.apache.commons.io.FileUtils;
 import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.conf.Configuration;
 import org.junit.Assert;
 import org.junit.Assume;
 import org.junit.Before;
@@ -44,6 +45,8 @@ import java.util.Random;
 import java.util.Set;
 import java.util.zip.GZIPInputStream;
 
+import static org.apache.hadoop.fs.FileSystem.FS_DEFAULT_NAME_KEY;
+
 /**
  * Unit test class for FrameworkUploader.
  */
@@ -130,6 +133,75 @@ public class TestFrameworkUploader {
   }
 
   /**
+   * Test the default ways how to specify filesystems.
+   */
+  @Test
+  public void testNoFilesystem() throws IOException {
+    FrameworkUploader uploader = new FrameworkUploader();
+    boolean success = uploader.parseArguments(new String[]{});
+    Assert.assertTrue("Expected to parse arguments", success);
+    Assert.assertEquals(
+        "Expected",
+        "file:////usr/lib/mr-framework.tar.gz#mr-framework", uploader.target);
+  }
+
+  /**
+   * Test the default ways how to specify filesystems.
+   */
+  @Test
+  public void testDefaultFilesystem() throws IOException {
+    FrameworkUploader uploader = new FrameworkUploader();
+    Configuration conf = new Configuration();
+    conf.set(FS_DEFAULT_NAME_KEY, "hdfs://namenode:555");
+    uploader.setConf(conf);
+    boolean success = uploader.parseArguments(new String[]{});
+    Assert.assertTrue("Expected to parse arguments", success);
+    Assert.assertEquals(
+        "Expected",
+        "hdfs://namenode:555/usr/lib/mr-framework.tar.gz#mr-framework",
+        uploader.target);
+  }
+
+  /**
+   * Test the explicit filesystem specification.
+   */
+  @Test
+  public void testExplicitFilesystem() throws IOException {
+    FrameworkUploader uploader = new FrameworkUploader();
+    Configuration conf = new Configuration();
+    uploader.setConf(conf);
+    boolean success = uploader.parseArguments(new String[]{
+        "-target",
+        "hdfs://namenode:555/usr/lib/mr-framework.tar.gz#mr-framework"
+    });
+    Assert.assertTrue("Expected to parse arguments", success);
+    Assert.assertEquals(
+        "Expected",
+        "hdfs://namenode:555/usr/lib/mr-framework.tar.gz#mr-framework",
+        uploader.target);
+  }
+
+  /**
+   * Test the conflicting filesystem specification.
+   */
+  @Test
+  public void testConflictingFilesystem() throws IOException {
+    FrameworkUploader uploader = new FrameworkUploader();
+    Configuration conf = new Configuration();
+    conf.set(FS_DEFAULT_NAME_KEY, "hdfs://namenode:555");
+    uploader.setConf(conf);
+    boolean success = uploader.parseArguments(new String[]{
+        "-target",
+        "file:///usr/lib/mr-framework.tar.gz#mr-framework"
+    });
+    Assert.assertTrue("Expected to parse arguments", success);
+    Assert.assertEquals(
+        "Expected",
+        "file:///usr/lib/mr-framework.tar.gz#mr-framework",
+        uploader.target);
+  }
+
+  /**
    * Test whether we can filter a class path properly.
    * @throws IOException test failure
    */


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[32/50] [abbrv] hadoop git commit: MAPREDUCE-6995. Uploader tool for Distributed Cache Deploy documentation ([email protected] via rkanter)

Reply via email to