[ https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931986#comment-17931986 ]
ASF GitHub Bot commented on MAPREDUCE-7500: ------------------------------------------- steveloughran commented on PR #7425: URL: https://github.com/apache/hadoop/pull/7425#issuecomment-2694616338 (sorry, accidentally closed it while trying to cancel my comment) * I'm not going to accept any changes to the core committer as it is too risky to change * happy to review changes to ManifestOutputCommitter * It doesn't use the FileContext APIs, it uses FileSystem, with a special integration extension for Abfs where file renames can be given the etag of the source file; this delivers resilience on rename failures caused by transient overload/recovery of the abfs store. > Support optimistic file renames in the commit protocol > ------------------------------------------------------ > > Key: MAPREDUCE-7500 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client > Environment: The commit protocol in FileOutputCommitter now supports > optimistic commits for files. This saves a FileSystem.getFileStatus call for > cases where it is unexpected to have conflict in the destination location at > commit time (e.g. Spark). This feature is disabled by default. To enable it > set mapreduce.fileoutputcommitter.optimistic.file.commit.enabled=true. > Reporter: Rob Reeves > Priority: Minor > Labels: pull-request-available > Attachments: flamegraph_commit.png > > > During a file commit in FileOutputCommitter, it assumes a file may be in the > destination location and if so will delete it first. This means for every > file commit is calls FileSystem.getFileStatus for the destination. For the > Spark use case, there will be nothing existing in the destination location > for the expected case so the getFileStatus call is wasted in all, but > exceptional and unexpected cases. > The getFileStatus call can take significant time. When I profiled a commit in > our environment (HDFS, intermittent latency issues) the > FileSystem.getFileStatus call takes 50% of the commit time. We have an > aggressive auto-msync setting, but even when I disabled msync I saw the same > behavior. I attached an example flame graph for the commit time > (getFileStatus time is highlighted in pink). > To avoid the time spent on getFileStatus, there should be an option to > optimistically commit the file assuming there will be no conflict in the > destination. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org