[ https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rob Reeves updated MAPREDUCE-7500: ---------------------------------- Environment: The commit protocol in FileOutputCommitter now supports optimistic commits for files. This saves a FileSystem.getFileStatus call for cases where it is unexpected to have conflict in the destination location at commit time (e.g. Spark). This feature is disabled by default. To enable it set mapreduce.fileoutputcommitter.optimistic.file.commit.enabled=true. > Support optimistic file renames in the commit protocol > ------------------------------------------------------ > > Key: MAPREDUCE-7500 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client > Environment: The commit protocol in FileOutputCommitter now supports > optimistic commits for files. This saves a FileSystem.getFileStatus call for > cases where it is unexpected to have conflict in the destination location at > commit time (e.g. Spark). This feature is disabled by default. To enable it > set mapreduce.fileoutputcommitter.optimistic.file.commit.enabled=true. > Reporter: Rob Reeves > Priority: Minor > Labels: pull-request-available > Attachments: flamegraph_commit.png > > > During a file commit in FileOutputCommitter, it assumes a file may be in the > destination location and if so will delete it first. This means for every > file commit is calls FileSystem.getFileStatus for the destination. For the > Spark use case, there will be nothing existing in the destination location > for the expected case so the getFileStatus call is wasted in all, but > exceptional and unexpected cases. > The getFileStatus call can take significant time. When I profiled a commit in > our environment (HDFS, intermittent latency issues) the > FileSystem.getFileStatus call takes 50% of the commit time. We have an > aggressive auto-msync setting, but even when I disabled msync I saw the same > behavior. I attached an example flame graph for the commit time > (getFileStatus time is highlighted in pink). > To avoid the time spent on getFileStatus, there should be an option to > optimistically commit the file assuming there will be no conflict in the > destination. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org