[ 
https://issues.apache.org/jira/browse/HADOOP-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001830#comment-17001830
 ] 

Amir Shenavandeh edited comment on HADOOP-16775 at 12/22/19 5:38 AM:
---------------------------------------------------------------------

The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can 
track the temp file based on the time it was created with in each attempt log.

in: 
[https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]
| |
 


was (Author: shenavandeh):
The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can 
track the temp file based on the time it was created with in each attempt log.

in: 
[https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]

 
|private Path getTmpFile(Path target, Mapper.Context context) {|

|Path targetWorkPath = new Path(context.getConfiguration().|

|get(DistCpConstants.CONF_LABEL_TARGET_WORK_PATH));|

| |

|Path root = target.equals(targetWorkPath)? targetWorkPath.getParent() : 
targetWorkPath;|

|LOG.info("Creating temp file: " +|

|new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString()));|

|return new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString());|

}

 

 

> Hadoop DistCp reuses the same temp file within the task for different files.
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-16775
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16775
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.0
>            Reporter: Amir Shenavandeh
>            Priority: Major
>         Attachments: patch.txt
>
>
> Hadoop DistCp reuses the same temp file name for all the files copied within 
> each task attempt and then moves them to the target name, which also a server 
> side copy. For copies over S3 this will cause inconsistency as S3 is only 
> consistent for read after writes, for brand new objects. There is also 
> inconsistency for contents of overwritten objects on S3.
> To avoid this, we should randomize the temp file name.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to