[
https://issues.apache.org/jira/browse/HADOOP-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001830#comment-17001830
]
Amir Shenavandeh edited comment on HADOOP-16775 at 12/22/19 5:38 AM:
---------------------------------------------------------------------
The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can
track the temp file based on the time it was created with in each attempt log.
in:
[https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]
| |
was (Author: shenavandeh):
The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can
track the temp file based on the time it was created with in each attempt log.
in:
[https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]
|private Path getTmpFile(Path target, Mapper.Context context) {|
|Path targetWorkPath = new Path(context.getConfiguration().|
|get(DistCpConstants.CONF_LABEL_TARGET_WORK_PATH));|
| |
|Path root = target.equals(targetWorkPath)? targetWorkPath.getParent() :
targetWorkPath;|
|LOG.info("Creating temp file: " +|
|new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString()));|
|return new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString());|
}
> Hadoop DistCp reuses the same temp file within the task for different files.
> ----------------------------------------------------------------------------
>
> Key: HADOOP-16775
> URL: https://issues.apache.org/jira/browse/HADOOP-16775
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.0
> Reporter: Amir Shenavandeh
> Priority: Major
> Attachments: patch.txt
>
>
> Hadoop DistCp reuses the same temp file name for all the files copied within
> each task attempt and then moves them to the target name, which also a server
> side copy. For copies over S3 this will cause inconsistency as S3 is only
> consistent for read after writes, for brand new objects. There is also
> inconsistency for contents of overwritten objects on S3.
> To avoid this, we should randomize the temp file name.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]