[ https://issues.apache.org/jira/browse/MAPREDUCE-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17929234#comment-17929234 ]
ASF GitHub Bot commented on MAPREDUCE-7500: ------------------------------------------- robreeves opened a new pull request, #7425: URL: https://github.com/apache/hadoop/pull/7425 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR This PR adds a new feature to commit files optimistically (assumes no conflicting file/dir in the destination) to avoid a `FileSystem.getFileStatus` RPC. The default behavior has not been changed. To use this feature this config must be set `mapreduce.fileoutputcommitter.optimistic.file.commit.enabled=true`. This is useful for cases like Spark where no destination conflict is expected and the `FileSystem.getFileStatus` RPC is wasted time. TODO add more details. ### How was this patch tested? TODO add unit tests I tested the performance of the changes using Spark writing to HDFS for partitioned and non-partitioned datasets. TODO add more details. ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Support optimistic file renames in the commit protocol > ------------------------------------------------------ > > Key: MAPREDUCE-7500 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7500 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client > Reporter: Rob Reeves > Priority: Minor > Attachments: flamegraph_commit.png > > > During a file commit in FileOutputCommitter, it assumes a file may be in the > destination location and if so will delete it first. This means for every > file commit is calls FileSystem.getFileStatus for the destination. For the > Spark use case, there will be nothing existing in the destination location > for the expected case so the getFileStatus call is wasted in all, but > exceptional and unexpected cases. > The getFileStatus call can take significant time. When I profiled a commit in > our environment (HDFS, intermittent latency issues) the > FileSystem.getFileStatus call takes 50% of the commit time. We have an > aggressive auto-msync setting, but even when I disabled msync I saw the same > behavior. I attached an example flame graph for the commit time > (getFileStatus time is highlighted in pink). > To avoid the time spent on getFileStatus, there should be an option to > optimistically commit the file assuming there will be no conflict in the > destination. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org