[
https://issues.apache.org/jira/browse/HADOOP-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amir Shenavandeh updated HADOOP-16775:
--------------------------------------
Comment: was deleted
(was: The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We
can track the temp file based on the time it was created with in each attempt
log.
in:
[https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]
| |
)
> Hadoop DistCp reuses the same temp file within the task for different files.
> ----------------------------------------------------------------------------
>
> Key: HADOOP-16775
> URL: https://issues.apache.org/jira/browse/HADOOP-16775
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.0
> Reporter: Amir Shenavandeh
> Priority: Major
> Labels: DistCp, S3, hadoop-tools
> Attachments: HADOOP-16775.patch
>
>
> Hadoop DistCp reuses the same temp file name for all the files copied within
> each task attempt and then moves them to the target name, which is also a
> server side copy. For copies to S3, this will cause inconsistency as S3 is
> only consistent for reads after writes, for brand new objects. There is also
> inconsistency for contents of overwritten objects on S3.
> To avoid this, we should randomize the temp file name and for each temp file
> use a different name.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]