[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Reeves updated MAPREDUCE-7500:
----------------------------------
    Description: 
During a file commit in FileOutputCommitter, it assumes a file may be in the 
destination location and if so will delete it first. This means for every file 
commit is calls FileSystem.getFileStatus for the destination. For the Spark use 
case, there will be nothing existing in the destination location for the 
expected case so the getFileStatus call is wasted in all, but exceptional and 
unexpected cases.

The getFileStatus call can take significant time. When I profiled a commit in 
our environment (HDFS, intermittent latency issues) the 
FileSystem.getFilestatus call takes 50% of the commit time. We have an 
aggressive auto-msync setting, but even when I disabled msync I saw the same 
behavior. I attached an example flame graph for the commit time.

To avoid the time spent on getFileStatus, there should be an option to 
optimistically commit the file assuming there will be no conflict in the 
destination.

  was:During a commit in FileOutputCommitter, every file commit checks if a 
file or directory exists in the destination and if so deletes it before the 
rename. The FileSystem.getFileStatus can take a significant amount of the total 
commit time. However, the happy path is that no file exists in the destination 
so the getFileStatus call is wasted time. The commit protocol can avoid this 
time by optimistically assuming there is no file in the destination and only 
attempt to delete it if the rename fails. In our HDFS environment this change 
reduced the commit time by 70%.


> Support optimistic file renames in the commit protocol
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-7500
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>            Reporter: Rob Reeves
>            Priority: Minor
>         Attachments: flamegraph_commit.png
>
>
> During a file commit in FileOutputCommitter, it assumes a file may be in the 
> destination location and if so will delete it first. This means for every 
> file commit is calls FileSystem.getFileStatus for the destination. For the 
> Spark use case, there will be nothing existing in the destination location 
> for the expected case so the getFileStatus call is wasted in all, but 
> exceptional and unexpected cases.
> The getFileStatus call can take significant time. When I profiled a commit in 
> our environment (HDFS, intermittent latency issues) the 
> FileSystem.getFilestatus call takes 50% of the commit time. We have an 
> aggressive auto-msync setting, but even when I disabled msync I saw the same 
> behavior. I attached an example flame graph for the commit time.
> To avoid the time spent on getFileStatus, there should be an option to 
> optimistically commit the file assuming there will be no conflict in the 
> destination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to