[ https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835505#comment-17835505 ]
ASF GitHub Bot commented on MAPREDUCE-7474: ------------------------------------------- steveloughran opened a new pull request, #6716: URL: https://github.com/apache/hadoop/pull/6716 Improve resilience of task commit save and rename operation with retries. * Retries of save() 5 attempts, with 500 millis sleep between them. No configuration. Issue: should we make this configurable? * Split delete(path, recursive) into deleteFile and rmdir for separate statistics. Test simulation expands to: * Support recovery through a countdown of calls to fail. * Simulate timeout before *and after* rename calls. This is based on #6596 but skips the rate limiting logic spanning common and azure, instead it only contains changes in manifest committer -easier to backport. ### How was this patch tested? * manual test of new tests * full test suite left to yetus * azure test run in progress. ### For code changes: - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > [ABFS] Improve commit resilience and performance in Manifest Committer > ---------------------------------------------------------------------- > > Key: MAPREDUCE-7474 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: client > Affects Versions: 3.4.0, 3.3.6 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Major > > * Manifest committer is not resilient to rename failures on task commit > without HADOOP-18012 rename recovery enabled. > * large burst of delete calls noted: are they needed > relates to HADOOP-19093 but takes a more minimal approach with goal of > changes in manifest committer only. > Initial proposed changes > * retry recovery on task commit rename, always (repeat save, delete, rename) > * audit delete use and see if it can be pruned > * maybe: rate limit some IO internally, but not delegate to abfs -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org