[
https://issues.apache.org/jira/browse/HADOOP-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liuxiaolong updated HADOOP-16872:
---------------------------------
Description:
We use distcp with -direct option to copy a file between two large directories.
We found it costed a few minutes. If we launch too much distcp jobs at the same
time, NameNode performance degradation is serious.
hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1
hdfs://cluster1:8020/source/100.log hdfs://cluster2:8020/target/100.jpg
|| ||Dir path||Count||
||Source dir|| hdfs://cluster1:8020/source/ ||100k+ files||
||Target dir||hdfs://cluster2:8020/target/ ||100k+ files||
Check code in CopyCommitter.java, we find in function
deleteAttemptTempFiles() has a code
targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." +
jobId.replaceAll("job","attempt") + "*"));
it will waste a lot of time when distcp between two large dirs. When we use
distcp with -direct option, it will direct write to the target file without
generate a '.distcp.tmp' temp file. So, i think this code need add a judgment
in function deleteAttemptTempFiles, if distcp with -direct option, do nothing ,
directly return .
was:
We use distcp with -direct option to copy a file between two large directories.
We found it costed a few minutes. If we launch too much distcp jobs at the same
time, NameNode performance degradation is serious.
hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1
hdfs://cluster1:8020/source/100.log hdfs://cluster2:8020/target/100.jpg
|| ||Dir path||Count||
||Source dir|| hdfs://cluster1:8020/source/ ||100k+ files||
||Target dir||hdfs://cluster2:8020/target/ ||100k+ files||
Check code in CopyCommitter.java, we find in function
deleteAttemptTempFiles() has a code
targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." +
jobId.replaceAll("job","attempt") + "*"));
it will waste a lot of time when distcp between two large dirs. When we use
distcp with -direct option, it will direct write to the target file without
generate a '.distcp.tmp' temp file. So, i think this code need add a judgment
in function deleteAttemptTempFiles, if distcp with -direct option, do nothing ,
directly return .
> Performance improvement when distcp files in large dir with -direct option
> --------------------------------------------------------------------------
>
> Key: HADOOP-16872
> URL: https://issues.apache.org/jira/browse/HADOOP-16872
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: liuxiaolong
> Priority: Major
>
> We use distcp with -direct option to copy a file between two large
> directories. We found it costed a few minutes. If we launch too much distcp
> jobs at the same time, NameNode performance degradation is serious.
> hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1
> hdfs://cluster1:8020/source/100.log hdfs://cluster2:8020/target/100.jpg
> || ||Dir path||Count||
> ||Source dir|| hdfs://cluster1:8020/source/ ||100k+ files||
> ||Target dir||hdfs://cluster2:8020/target/ ||100k+ files||
>
> Check code in CopyCommitter.java, we find in function
> deleteAttemptTempFiles() has a code
> targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." +
> jobId.replaceAll("job","attempt") + "*"));
> it will waste a lot of time when distcp between two large dirs. When we use
> distcp with -direct option, it will direct write to the target file without
> generate a '.distcp.tmp' temp file. So, i think this code need add a
> judgment in function deleteAttemptTempFiles, if distcp with -direct option,
> do nothing , directly return .
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]