[ 
https://issues.apache.org/jira/browse/HADOOP-18629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693219#comment-17693219
 ] 

ASF GitHub Bot commented on HADOOP-18629:
-----------------------------------------

steveloughran commented on code in PR #5391:
URL: https://github.com/apache/hadoop/pull/5391#discussion_r1117059919


##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -594,4 +595,49 @@ public void testUpdateRoot() throws Exception {
     assertEquals(srcStatus.getModificationTime(),
         destStatus2.getModificationTime());
   }
+
+  @Test
+  public void testFavoredNodes() throws Exception {
+    final String testRoot = "/testdir";
+    final String testSrc = testRoot + "/" + SRCDAT;
+    final String testDst = testRoot + "/" + DSTDAT;
+
+    String nnUri = FileSystem.getDefaultUri(conf).toString();
+    String rootStr = nnUri + testSrc;
+    String tgtStr = nnUri + testDst;
+
+    FileEntry[] srcFiles = {
+        new FileEntry(SRCDAT, true),
+        new FileEntry(SRCDAT + "/file10", false)
+    };
+
+    DistributedFileSystem fs = (DistributedFileSystem) 
FileSystem.get(URI.create(nnUri), conf);
+    createFiles(fs, testRoot, srcFiles, -1);
+
+    // sad path
+    assertNotEquals(ToolRunner.run(conf, new DistCp(),

Review Comment:
   the args are the wrong way round fro these assertions.
   
   switch to AssertJ assertions and add good .descriptions, so if a jenkins 
runs fails we know what went wrong, rather than just what line. Right



##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -364,6 +364,7 @@ Command Line Options
 | `-direct` | Write directly to destination paths | Useful for avoiding 
potentially very expensive temporary file rename operations when the 
destination is an object store |
 | `-useiterator` | Uses single threaded listStatusIterator to build listing | 
Useful for saving memory at the client side. Using this option will ignore the 
numListstatusThreads option |
 | `-updateRoot` | Update root directory attributes (eg permissions, ownership 
...) | Useful if you need to enforce root directory attributes update when 
using distcp |
+| `-favoredNodes` | Specify favored nodes (Desired option input format: 
host1:port1,host2:port2,...) | Useful if you need to specify favored nodes when 
using distcp |

Review Comment:
   +. Requires the destination to be an hdfs filesystem



##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -32,6 +32,7 @@
 import java.util.List;
 import java.util.Random;
 
+import org.apache.hadoop.hdfs.server.datanode.DataNode;

Review Comment:
   nit: must go in the right place in the org.apache.hadoop imports.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem 
targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws 
UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {
+      String[] split = hostAndPort.split(":");

Review Comment:
   log at debug, or maybe even at info.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java:
##########
@@ -239,6 +239,21 @@ public static DistCpOptions parse(String[] args)
       }
     }
 
+    if (command.hasOption(DistCpOptionSwitch.FAVORED_NODES.getSwitch())) {
+      String favoredNodesStr = getVal(command, 
DistCpOptionSwitch.FAVORED_NODES.getSwitch().trim());
+      if (StringUtils.isEmpty(favoredNodesStr)) {

Review Comment:
   if you pull this out to a @VisibleForTesting package scoped method then unit 
tests could to try to break it through invalid args, e.g
   * trailing ,
   * empty string
    invalid node/valid node bad port.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem 
targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws 
UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {
+      String[] split = hostAndPort.split(":");
+      if (split.length != 2) {
+        throw new IllegalArgumentException("Illegal favoredNodes parameter: " 
+ hostAndPort);

Review Comment:
   prefer `org.apache.hadoop.util.Preconditions` here, e.g 
   ```
   checkArgument(split.length == 2, ""Illegal favoredNodes parameter: %s", 
hostAndPort)
   ```
    
   



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java:
##########
@@ -164,6 +164,8 @@ public final class DistCpOptions {
 
   private final boolean updateRoot;
 
+  private final String favoredNodes;
+

Review Comment:
   nit: add javadocs.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem 
targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws 
UnknownHostException {

Review Comment:
   javadocs to explain what happens, when exceptions are raised.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem 
targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws 
UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {

Review Comment:
   what happens if an empty string is passed in? it should be an error, right? 
so add a test



##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java:
##########
@@ -574,4 +574,15 @@ public void testUpdateRoot() {
         .build();
     Assert.assertTrue(options.shouldUpdateRoot());
   }
+
+  @Test
+  public void testFavoredNodes() {
+    final DistCpOptions options = new DistCpOptions.Builder(
+        Collections.singletonList(
+            new Path("hdfs://localhost:8020/source")),
+        new Path("hdfs://localhost:8020/target/"))
+        .withFavoredNodes("localhost:50010")
+        .build();
+    Assert.assertNotNull(options.getFavoredNodes());

Review Comment:
   prefer assertJ for new tests, with a message, and ideally verification the 
node value went all the way through



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -223,9 +233,25 @@ private long copyToFile(Path targetPath, FileSystem 
targetFS,
       FSDataOutputStream out;
       ChecksumOpt checksumOpt = getChecksumOpt(fileAttributes, sourceChecksum);
       if (!preserveEC || ecPolicy == null) {
-        out = targetFS.create(targetPath, permission,
-            EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE), 
copyBufferSize,
-            repl, blockSize, context, checksumOpt);
+        if (targetFS instanceof DistributedFileSystem

Review Comment:
   There should be one path if preserving ec or favoredNodes is set, so switch 
to hdfsbuilder, leaving the other path as create()
   
   





> Hadoop DistCp supports specifying favoredNodes for data copying
> ---------------------------------------------------------------
>
>                 Key: HADOOP-18629
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18629
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 3.3.4
>            Reporter: zhuyaogai
>            Priority: Major
>              Labels: pull-request-available
>
> When importing large scale data to HBase, we always generate the hfiles with 
> other Hadoop cluster, use the Distcp tool to copy the data to the HBase 
> cluster, and bulkload data to HBase table. However, the data locality is 
> rather low which may result in high query latency. After taking a compaction 
> it will recover. Therefore, we can increase the data locality by specifying 
> the favoredNodes in Distcp.
> Could I submit a pull request to optimize it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to