[
https://issues.apache.org/jira/browse/HADOOP-18629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693219#comment-17693219
]
ASF GitHub Bot commented on HADOOP-18629:
-----------------------------------------
steveloughran commented on code in PR #5391:
URL: https://github.com/apache/hadoop/pull/5391#discussion_r1117059919
##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -594,4 +595,49 @@ public void testUpdateRoot() throws Exception {
assertEquals(srcStatus.getModificationTime(),
destStatus2.getModificationTime());
}
+
+ @Test
+ public void testFavoredNodes() throws Exception {
+ final String testRoot = "/testdir";
+ final String testSrc = testRoot + "/" + SRCDAT;
+ final String testDst = testRoot + "/" + DSTDAT;
+
+ String nnUri = FileSystem.getDefaultUri(conf).toString();
+ String rootStr = nnUri + testSrc;
+ String tgtStr = nnUri + testDst;
+
+ FileEntry[] srcFiles = {
+ new FileEntry(SRCDAT, true),
+ new FileEntry(SRCDAT + "/file10", false)
+ };
+
+ DistributedFileSystem fs = (DistributedFileSystem)
FileSystem.get(URI.create(nnUri), conf);
+ createFiles(fs, testRoot, srcFiles, -1);
+
+ // sad path
+ assertNotEquals(ToolRunner.run(conf, new DistCp(),
Review Comment:
the args are the wrong way round fro these assertions.
switch to AssertJ assertions and add good .descriptions, so if a jenkins
runs fails we know what went wrong, rather than just what line. Right
##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -364,6 +364,7 @@ Command Line Options
| `-direct` | Write directly to destination paths | Useful for avoiding
potentially very expensive temporary file rename operations when the
destination is an object store |
| `-useiterator` | Uses single threaded listStatusIterator to build listing |
Useful for saving memory at the client side. Using this option will ignore the
numListstatusThreads option |
| `-updateRoot` | Update root directory attributes (eg permissions, ownership
...) | Useful if you need to enforce root directory attributes update when
using distcp |
+| `-favoredNodes` | Specify favored nodes (Desired option input format:
host1:port1,host2:port2,...) | Useful if you need to specify favored nodes when
using distcp |
Review Comment:
+. Requires the destination to be an hdfs filesystem
##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -32,6 +32,7 @@
import java.util.List;
import java.util.Random;
+import org.apache.hadoop.hdfs.server.datanode.DataNode;
Review Comment:
nit: must go in the right place in the org.apache.hadoop imports.
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem
targetFS,
context);
}
+ private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws
UnknownHostException {
+ List<InetSocketAddress> result = new ArrayList<>();
+ for (String hostAndPort : favoredNodesStr.split(",")) {
+ String[] split = hostAndPort.split(":");
Review Comment:
log at debug, or maybe even at info.
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java:
##########
@@ -239,6 +239,21 @@ public static DistCpOptions parse(String[] args)
}
}
+ if (command.hasOption(DistCpOptionSwitch.FAVORED_NODES.getSwitch())) {
+ String favoredNodesStr = getVal(command,
DistCpOptionSwitch.FAVORED_NODES.getSwitch().trim());
+ if (StringUtils.isEmpty(favoredNodesStr)) {
Review Comment:
if you pull this out to a @VisibleForTesting package scoped method then unit
tests could to try to break it through invalid args, e.g
* trailing ,
* empty string
invalid node/valid node bad port.
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem
targetFS,
context);
}
+ private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws
UnknownHostException {
+ List<InetSocketAddress> result = new ArrayList<>();
+ for (String hostAndPort : favoredNodesStr.split(",")) {
+ String[] split = hostAndPort.split(":");
+ if (split.length != 2) {
+ throw new IllegalArgumentException("Illegal favoredNodes parameter: "
+ hostAndPort);
Review Comment:
prefer `org.apache.hadoop.util.Preconditions` here, e.g
```
checkArgument(split.length == 2, ""Illegal favoredNodes parameter: %s",
hostAndPort)
```
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java:
##########
@@ -164,6 +164,8 @@ public final class DistCpOptions {
private final boolean updateRoot;
+ private final String favoredNodes;
+
Review Comment:
nit: add javadocs.
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem
targetFS,
context);
}
+ private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws
UnknownHostException {
Review Comment:
javadocs to explain what happens, when exceptions are raised.
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem
targetFS,
context);
}
+ private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws
UnknownHostException {
+ List<InetSocketAddress> result = new ArrayList<>();
+ for (String hostAndPort : favoredNodesStr.split(",")) {
Review Comment:
what happens if an empty string is passed in? it should be an error, right?
so add a test
##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java:
##########
@@ -574,4 +574,15 @@ public void testUpdateRoot() {
.build();
Assert.assertTrue(options.shouldUpdateRoot());
}
+
+ @Test
+ public void testFavoredNodes() {
+ final DistCpOptions options = new DistCpOptions.Builder(
+ Collections.singletonList(
+ new Path("hdfs://localhost:8020/source")),
+ new Path("hdfs://localhost:8020/target/"))
+ .withFavoredNodes("localhost:50010")
+ .build();
+ Assert.assertNotNull(options.getFavoredNodes());
Review Comment:
prefer assertJ for new tests, with a message, and ideally verification the
node value went all the way through
##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -223,9 +233,25 @@ private long copyToFile(Path targetPath, FileSystem
targetFS,
FSDataOutputStream out;
ChecksumOpt checksumOpt = getChecksumOpt(fileAttributes, sourceChecksum);
if (!preserveEC || ecPolicy == null) {
- out = targetFS.create(targetPath, permission,
- EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE),
copyBufferSize,
- repl, blockSize, context, checksumOpt);
+ if (targetFS instanceof DistributedFileSystem
Review Comment:
There should be one path if preserving ec or favoredNodes is set, so switch
to hdfsbuilder, leaving the other path as create()
> Hadoop DistCp supports specifying favoredNodes for data copying
> ---------------------------------------------------------------
>
> Key: HADOOP-18629
> URL: https://issues.apache.org/jira/browse/HADOOP-18629
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools/distcp
> Affects Versions: 3.3.4
> Reporter: zhuyaogai
> Priority: Major
> Labels: pull-request-available
>
> When importing large scale data to HBase, we always generate the hfiles with
> other Hadoop cluster, use the Distcp tool to copy the data to the HBase
> cluster, and bulkload data to HBase table. However, the data locality is
> rather low which may result in high query latency. After taking a compaction
> it will recover. Therefore, we can increase the data locality by specifying
> the favoredNodes in Distcp.
> Could I submit a pull request to optimize it?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]