[
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835505#comment-17835505
]
ASF GitHub Bot commented on MAPREDUCE-7474:
-------------------------------------------
steveloughran opened a new pull request, #6716:
URL: https://github.com/apache/hadoop/pull/6716
Improve resilience of task commit save and rename operation with retries.
* Retries of save()
5 attempts, with 500 millis sleep between them. No configuration.
Issue: should we make this configurable?
* Split delete(path, recursive) into deleteFile and rmdir for separate
statistics.
Test simulation expands to:
* Support recovery through a countdown of calls to fail.
* Simulate timeout before *and after* rename calls.
This is based on #6596 but skips the rate limiting logic spanning common and
azure,
instead it only contains changes in manifest committer -easier to backport.
### How was this patch tested?
* manual test of new tests
* full test suite left to yetus
* azure test run in progress.
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> [ABFS] Improve commit resilience and performance in Manifest Committer
> ----------------------------------------------------------------------
>
> Key: MAPREDUCE-7474
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: client
> Affects Versions: 3.4.0, 3.3.6
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
>
> * Manifest committer is not resilient to rename failures on task commit
> without HADOOP-18012 rename recovery enabled.
> * large burst of delete calls noted: are they needed
> relates to HADOOP-19093 but takes a more minimal approach with goal of
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned
> * maybe: rate limit some IO internally, but not delegate to abfs
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]