[ https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rob Reeves updated MAPREDUCE-7500: ---------------------------------- Description: During a file commit in FileOutputCommitter, it assumes a file may be in the destination location and if so will delete it first. This means for every file commit is calls FileSystem.getFileStatus for the destination. For the Spark use case, there will be nothing existing in the destination location for the expected case so the getFileStatus call is wasted in all, but exceptional and unexpected cases. The getFileStatus call can take significant time. When I profiled a commit in our environment (HDFS, intermittent latency issues) the FileSystem.getFilestatus call takes 50% of the commit time. We have an aggressive auto-msync setting, but even when I disabled msync I saw the same behavior. I attached an example flame graph for the commit time. To avoid the time spent on getFileStatus, there should be an option to optimistically commit the file assuming there will be no conflict in the destination. was:During a commit in FileOutputCommitter, every file commit checks if a file or directory exists in the destination and if so deletes it before the rename. The FileSystem.getFileStatus can take a significant amount of the total commit time. However, the happy path is that no file exists in the destination so the getFileStatus call is wasted time. The commit protocol can avoid this time by optimistically assuming there is no file in the destination and only attempt to delete it if the rename fails. In our HDFS environment this change reduced the commit time by 70%. > Support optimistic file renames in the commit protocol > ------------------------------------------------------ > > Key: MAPREDUCE-7500 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client > Reporter: Rob Reeves > Priority: Minor > Attachments: flamegraph_commit.png > > > During a file commit in FileOutputCommitter, it assumes a file may be in the > destination location and if so will delete it first. This means for every > file commit is calls FileSystem.getFileStatus for the destination. For the > Spark use case, there will be nothing existing in the destination location > for the expected case so the getFileStatus call is wasted in all, but > exceptional and unexpected cases. > The getFileStatus call can take significant time. When I profiled a commit in > our environment (HDFS, intermittent latency issues) the > FileSystem.getFilestatus call takes 50% of the commit time. We have an > aggressive auto-msync setting, but even when I disabled msync I saw the same > behavior. I attached an example flame graph for the commit time. > To avoid the time spent on getFileStatus, there should be an option to > optimistically commit the file assuming there will be no conflict in the > destination. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org