[3/6] hadoop git commit: HDFS-10678. Documenting NNThroughputBenchmark tool. (Contributed by Mingliang Liu)

liuml07 Mon, 15 Aug 2016 20:58:28 -0700

HDFS-10678. Documenting NNThroughputBenchmark tool. (Contributed by Mingliang 
Liu)



Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/382d6152
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/382d6152
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/382d6152

Branch: refs/heads/trunk
Commit: 382d6152602339fe58169b2918ec74e7a7cd5581
Parents: 4bcbef3
Author: Mingliang Liu <lium...@apache.org>
Authored: Mon Aug 15 20:22:14 2016 -0700
Committer: Mingliang Liu <lium...@apache.org>
Committed: Mon Aug 15 20:22:14 2016 -0700

----------------------------------------------------------------------
 .../src/site/markdown/Benchmarking.md           | 106 +++++++++++++++++++
 .../server/namenode/NNThroughputBenchmark.java  |  32 +-----
 hadoop-project/src/site/site.xml                |   1 +
 3 files changed, 110 insertions(+), 29 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/382d6152/hadoop-common-project/hadoop-common/src/site/markdown/Benchmarking.md
----------------------------------------------------------------------
diff --git 
a/hadoop-common-project/hadoop-common/src/site/markdown/Benchmarking.md 
b/hadoop-common-project/hadoop-common/src/site/markdown/Benchmarking.md
new file mode 100644
index 0000000..678dcee
--- /dev/null
+++ b/hadoop-common-project/hadoop-common/src/site/markdown/Benchmarking.md
@@ -0,0 +1,106 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Hadoop Benchmarking
+
+<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
+
+This page is to discuss benchmarking Hadoop using tools it provides.
+
+## NNThroughputBenchmark
+
+### Overview
+
+**NNThroughputBenchmark**, as its name indicates, is a name-node throughput 
benchmark, which runs a series of client threads on a single node against a 
name-node. If no name-node is configured, it will firstly start a name-node in 
the same process (_standalone mode_), in which case each client repetitively 
performs the same operation by directly calling the respective name-node 
methods. Otherwise, the benchmark will perform the operations against a remote 
name-node via client protocol RPCs (_remote mode_). Either way, all clients are 
running locally in a single process rather than remotely across different 
nodes. The reason is to avoid communication overhead caused by RPC connections 
and serialization, and thus reveal the upper bound of pure name-node 
performance.
+
+The benchmark first generates inputs for each thread so that the input 
generation overhead does not effect the resulting statistics. The number of 
operations performed by threads is practically the same. Precisely, the 
difference between the number of operations performed by any two threads does 
not exceed 1. Then the benchmark executes the specified number of operations 
using the specified number of threads and outputs the resulting stats by 
measuring the number of operations performed by the name-node per second.
+
+### Commands
+
+The general command line syntax is:
+
+`hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
[genericOptions] [commandOptions]`
+
+#### Generic Options
+
+This benchmark honors the [Hadoop command-line Generic 
Options](CommandsManual.html#Generic_Options) to alter its behavior. The 
benchmark, as other tools, will rely on the `fs.defaultFS` config, which is 
overridable by `-fs` command option, to run standalone mode or remote mode. If 
the `fs.defaultFS` scheme is not specified or is `file` (local), the benchmark 
will run in _standalone mode_. Specially, the _remote_ name-node config 
`dfs.namenode.fs-limits.min-block-size` should be set as 16 while in 
_standalone mode_ the benchmark turns off minimum block size verification for 
its internal name-node.
+
+#### Command Options
+
+The following are all supported command options:
+
+| COMMAND\_OPTION    | Description |
+|:---- |:---- |
+|`-op` | Specify the operation. This option must be provided and should be the 
first option. |
+|`-logLevel` | Specify the logging level when the benchmark runs. The default 
logging level is ERROR. |
+|`-UGCacheRefreshCount` | After every specified number of operations, the 
benchmark purges the name-node's user group cache. By default the refresh is 
never called. |
+|`-keepResults` | If specified, do not clean up the name-space after 
execution. By default the name-space will be removed after test. |
+
+##### Operations Supported
+
+Following are all the operations supported along with their respective 
operation-specific parameters (all optional) and default values.
+
+| OPERATION\_OPTION    | Operation-specific parameters |
+|:---- |:---- |
+|`all` | _options for other operations_ |
+|`create` | [`-threads 3`] [`-files 10`] [`-filesPerDir 4`] [`-close`] |
+|`mkdirs` | [`-threads 3`] [`-dirs 10`] [`-dirsPerDir 2`] |
+|`open` | [`-threads 3`] [`-files 10`] [`-filesPerDir 4`] [`-useExisting`] |
+|`delete` | [`-threads 3`] [`-files 10`] [`-filesPerDir 4`] [`-useExisting`] |
+|`fileStatus` | [`-threads 3`] [`-files 10`] [`-filesPerDir 4`] 
[`-useExisting`] |
+|`rename` | [`-threads 3`] [`-files 10`] [`-filesPerDir 4`] [`-useExisting`] |
+|`blockReport` | [`-datanodes 10`] [`-reports 30`] [`-blocksPerReport 100`] 
[`-blocksPerFile 10`] |
+|`replication` | [`-datanodes 10`] [`-nodesToDecommission 1`] 
[`-nodeReplicationLimit 100`] [`-totalBlocks 100`] [`-replication 3`] |
+|`clean` | N/A |
+
+##### Operation Options
+
+When running benchmarks with the above operation(s), please provide 
operation-specific parameters illustrated as following.
+
+| OPERATION\_SPECIFIC\_OPTION    | Description |
+|:---- |:---- |
+|`-threads` | Number of total threads to run the respective operation. |
+|`-files` | Number of total files for the respective operation. |
+|`-dirs` | Number of total directories for the respective operation. |
+|`-filesPerDir` | Number of files per directory. |
+|`-close` | Close the files after creation. |
+|`-dirsPerDir` | Number of directories per directory. |
+|`-useExisting` | If specified, do not recreate the name-space, use existing 
data. |
+|`-datanodes` | Total number of simulated data-nodes. |
+|`-reports` | Total number of block reports to send. |
+|`-blocksPerReport` | Number of blocks per report. |
+|`-blocksPerFile` | Number of blocks per file. |
+|`-nodesToDecommission` | Total number of simulated data-nodes to 
decommission. |
+|`-nodeReplicationLimit` | The maximum number of outgoing replication streams 
for a data-node. |
+|`-totalBlocks` | Number of total blocks to operate. |
+|`-replication` | Replication factor. Will be adjusted to number of data-nodes 
if it is larger than that. |
+
+### Reports
+
+The benchmark measures the number of operations performed by the name-node per 
second. Specifically, for each operation tested, it reports the total running 
time in seconds (_Elapsed Time_), operation throughput (_Ops per sec_), and 
average time for the operations (_Average Time_). The higher, the better.
+
+Following is a sample reports by running following commands that opens 100K 
files with 1K threads against a remote name-node. See [HDFS scalability: the 
limits to 
growth](https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf)
 for real-world benchmark stats.
+```
+$ hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
hdfs://nameservice:9000 -op open -threads 1000 -files 100000
+
+--- open inputs ---
+nrFiles = 100000
+nrThreads = 1000
+nrFilesPerDir = 4
+--- open stats  ---
+# operations: 100000
+Elapsed Time: 9510
+ Ops per sec: 10515.247108307045
+Average Time: 90
+```

http://git-wip-us.apache.org/repos/asf/hadoop/blob/382d6152/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
----------------------------------------------------------------------
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
 
b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
index efd731e..be2a678 100644
--- 
a/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
+++ 
b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
@@ -95,36 +95,10 @@ import org.apache.log4j.LogManager;
  * except for the name-node. Each operation is executed
  * by calling directly the respective name-node method.
  * The name-node here is real all other components are simulated.
- * 
- * This benchmark supports
- * <a 
href="{@docRoot}/../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options">
- * standard command-line options</a>. If you use remote namenode by 
<tt>-fs</tt>
- * option, its config <tt>dfs.namenode.fs-limits.min-block-size</tt> should be
- * set as 16.
  *
- * Command line arguments for the benchmark include:
- * <ol>
- * <li>total number of operations to be performed,</li>
- * <li>number of threads to run these operations,</li>
- * <li>followed by operation specific input parameters.</li>
- * <li>-logLevel L specifies the logging level when the benchmark runs.
- * The default logging level is {@link Level#ERROR}.</li>
- * <li>-UGCacheRefreshCount G will cause the benchmark to call
- * {@link NameNodeRpcServer#refreshUserToGroupsMappings} after
- * every G operations, which purges the name-node's user group cache.
- * By default the refresh is never called.</li>
- * <li>-keepResults do not clean up the name-space after execution.</li>
- * <li>-useExisting do not recreate the name-space, use existing data.</li>
- * </ol>
- * 
- * The benchmark first generates inputs for each thread so that the
- * input generation overhead does not effect the resulting statistics.
- * The number of operations performed by threads is practically the same. 
- * Precisely, the difference between the number of operations 
- * performed by any two threads does not exceed 1.
- * 
- * Then the benchmark executes the specified number of operations using 
- * the specified number of threads and outputs the resulting stats.
+ * For usage, please see <a 
href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Benchmarking.html#NNThroughputBenchmark";>the
 documentation</a>.
+ * Meanwhile, if you change the usage of this program, please also update the
+ * documentation accordingly.
  */
 public class NNThroughputBenchmark implements Tool {
   private static final Log LOG = 
LogFactory.getLog(NNThroughputBenchmark.class);

http://git-wip-us.apache.org/repos/asf/hadoop/blob/382d6152/hadoop-project/src/site/site.xml
----------------------------------------------------------------------
diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml
index de09016..9fa1469 100644
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -169,6 +169,7 @@
       <item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
       <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
       <item name="Scheduler Load Simulator" 
href="hadoop-sls/SchedulerLoadSimulator.html"/>
+      <item name="Hadoop Benchmarking" 
href="hadoop-project-dist/hadoop-common/Benchmarking.html"/>
     </menu>
 
     <menu name="Reference" inherit="top">


---------------------------------------------------------------------
To unsubscribe, e-mail: common-commits-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-commits-h...@hadoop.apache.org

[3/6] hadoop git commit: HDFS-10678. Documenting NNThroughputBenchmark tool. (Contributed by Mingliang Liu)

Reply via email to