[ 
https://issues.apache.org/jira/browse/HADOOP-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883808#comment-16883808
 ] 

Steve Loughran commented on HADOOP-16428:
-----------------------------------------

the s3a committers aim to eliminate the two renames which take place on task 
commit, and uses various devious techniques to pass commit information from the 
workers to a the driver. I supports the mapreduce 2.0 APIs only.

Distcp may use mapreduce but its got a very different task profile: first the 
files to copy are listed, then the workers upload each in turn, then rename 
into place. The rename is there so that an incomplete upload isn't visible.

Distcp with -direct does no renames, and you don't get the incomplete uploads, 
so I don't think there's any reason to put effort in here. If someone was to, 
look at the multipart upload API of Hadoop 3.3 and the ability to upload 
different blocks in parallel and coalesce them at the end.

A key distcp limitation is that you cant use change detection between source 
and dest if the two stores have different checksum algorithms/values; something 
to track the values there across jobs would be good.

> Distcp make use of S3a Committers, be it magic or staging
> ---------------------------------------------------------
>
>                 Key: HADOOP-16428
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16428
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.1.1
>            Reporter: Sahil Kaw
>            Priority: Minor
>             Fix For: 3.1.2
>
>
> Currently, I don't see Distcp make use of S3a Committers, be it Magic or 
> Staging and I have noticed most of the jobs which use MapReduce frameworks 
> use S3 committers except distcp. Distcp makes use of the FileOutputCommitter 
> even if S3a committer parameters are specified in the core-site.xml. Is this 
> by design? If yes, can someone please explain the reason for that. Are there 
> any limitations or potential risks of using S3a committers with Distcp? 
> I know there is a "-direct" option that can be used with the 
> FileOutputCommitter in order to avoid renaming while committing fr object 
> Stores. But if anyone can put some light on the current limitation of S3a 
> committers with distcp and reason for choosing FileOutputCommitters for 
> Distcp over S3a committers, it would be helpful.  Thanks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to