[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Reeves updated MAPREDUCE-7500:
----------------------------------
    Description: 
During a file commit in FileOutputCommitter, it assumes a file may be in the 
destination location and if so will delete it first. This means for every file 
commit is calls FileSystem.getFileStatus for the destination. For the Spark use 
case, there will be nothing existing in the destination location for the 
expected case so the getFileStatus call is wasted in all, but exceptional and 
unexpected cases.

The getFileStatus call can take significant time. When I profiled a commit in 
our environment (HDFS, intermittent latency issues) the 
FileSystem.getFileStatus call takes 50% of the commit time. We have an 
aggressive auto-msync setting, but even when I disabled msync I saw the same 
behavior. I attached an example flame graph for the commit time (getFileStatus 
time is highlighted in pink).

To avoid the time spent on getFileStatus, there should be an option to 
optimistically commit the file assuming there will be no conflict in the 
destination.

  was:
During a file commit in FileOutputCommitter, it assumes a file may be in the 
destination location and if so will delete it first. This means for every file 
commit is calls FileSystem.getFileStatus for the destination. For the Spark use 
case, there will be nothing existing in the destination location for the 
expected case so the getFileStatus call is wasted in all, but exceptional and 
unexpected cases.

The getFileStatus call can take significant time. When I profiled a commit in 
our environment (HDFS, intermittent latency issues) the 
FileSystem.getFilestatus call takes 50% of the commit time. We have an 
aggressive auto-msync setting, but even when I disabled msync I saw the same 
behavior. I attached an example flame graph for the commit time.

To avoid the time spent on getFileStatus, there should be an option to 
optimistically commit the file assuming there will be no conflict in the 
destination.


> Support optimistic file renames in the commit protocol
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-7500
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>            Reporter: Rob Reeves
>            Priority: Minor
>         Attachments: flamegraph_commit.png
>
>
> During a file commit in FileOutputCommitter, it assumes a file may be in the 
> destination location and if so will delete it first. This means for every 
> file commit is calls FileSystem.getFileStatus for the destination. For the 
> Spark use case, there will be nothing existing in the destination location 
> for the expected case so the getFileStatus call is wasted in all, but 
> exceptional and unexpected cases.
> The getFileStatus call can take significant time. When I profiled a commit in 
> our environment (HDFS, intermittent latency issues) the 
> FileSystem.getFileStatus call takes 50% of the commit time. We have an 
> aggressive auto-msync setting, but even when I disabled msync I saw the same 
> behavior. I attached an example flame graph for the commit time 
> (getFileStatus time is highlighted in pink).
> To avoid the time spent on getFileStatus, there should be an option to 
> optimistically commit the file assuming there will be no conflict in the 
> destination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to