[GitHub] [ozone] adoroszlai opened a new pull request, #4285: HDDS-7989. UnhealthyReplicationProcessor retries failure without delay


adoroszlai opened a new pull request, #4285:
URL: https://github.com/apache/ozone/pull/4285

## What changes were proposed in this pull request?

`UnhealthyReplicationProcessor#processAll` requeues any failed task. Such
tasks are attempted in the same `processAll` call, before exiting the loop.
This can flood SCM logs until the cause of the error is resolved.

This causes Github's environment to [run out of disk
space](https://github.com/adoroszlai/hadoop-ozone/actions/runs/4205417969/jobs/7297733162#step:5:1527)
in just a few minutes after testing EC reconstruction read (test being added
in HDDS-7982).

This PR proposes to collect failed container health results and requeue them
only after exiting the loop.

https://issues.apache.org/jira/browse/HDDS-7989

## How was this patch tested?

Added unit test.

Also verified together with HDDS-7982 (which uncovered the problem without
this fix):

https://github.com/adoroszlai/hadoop-ozone/actions/runs/4207471575/jobs/7302558782

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4207414175

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to