[
https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rob Reeves updated MAPREDUCE-7500:
----------------------------------
Description:
During a file commit in FileOutputCommitter, it assumes a file may be in the
destination location and if so will delete it first. This means for every file
commit is calls FileSystem.getFileStatus for the destination. For the Spark use
case, there will be nothing existing in the destination location for the
expected case so the getFileStatus call is wasted in all, but exceptional and
unexpected cases.
The getFileStatus call can take significant time. When I profiled a commit in
our environment (HDFS, intermittent latency issues) the
FileSystem.getFileStatus call takes 50% of the commit time. We have an
aggressive auto-msync setting, but even when I disabled msync I saw the same
behavior. I attached an example flame graph for the commit time (getFileStatus
time is highlighted in pink).
To avoid the time spent on getFileStatus, there should be an option to
optimistically commit the file assuming there will be no conflict in the
destination.
was:
During a file commit in FileOutputCommitter, it assumes a file may be in the
destination location and if so will delete it first. This means for every file
commit is calls FileSystem.getFileStatus for the destination. For the Spark use
case, there will be nothing existing in the destination location for the
expected case so the getFileStatus call is wasted in all, but exceptional and
unexpected cases.
The getFileStatus call can take significant time. When I profiled a commit in
our environment (HDFS, intermittent latency issues) the
FileSystem.getFilestatus call takes 50% of the commit time. We have an
aggressive auto-msync setting, but even when I disabled msync I saw the same
behavior. I attached an example flame graph for the commit time.
To avoid the time spent on getFileStatus, there should be an option to
optimistically commit the file assuming there will be no conflict in the
destination.
> Support optimistic file renames in the commit protocol
> ------------------------------------------------------
>
> Key: MAPREDUCE-7500
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: client
> Reporter: Rob Reeves
> Priority: Minor
> Attachments: flamegraph_commit.png
>
>
> During a file commit in FileOutputCommitter, it assumes a file may be in the
> destination location and if so will delete it first. This means for every
> file commit is calls FileSystem.getFileStatus for the destination. For the
> Spark use case, there will be nothing existing in the destination location
> for the expected case so the getFileStatus call is wasted in all, but
> exceptional and unexpected cases.
> The getFileStatus call can take significant time. When I profiled a commit in
> our environment (HDFS, intermittent latency issues) the
> FileSystem.getFileStatus call takes 50% of the commit time. We have an
> aggressive auto-msync setting, but even when I disabled msync I saw the same
> behavior. I attached an example flame graph for the commit time
> (getFileStatus time is highlighted in pink).
> To avoid the time spent on getFileStatus, there should be an option to
> optimistically commit the file assuming there will be no conflict in the
> destination.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]