rmatharu commented on a change in pull request #1104: SAMZA-2266: Introduce a backoff when there are repeated failures for host-affinity allocations URL: https://github.com/apache/samza/pull/1104#discussion_r309472980
########## File path: docs/learn/documentation/versioned/jobs/samza-configurations.md ########## @@ -292,6 +292,7 @@ Samza supports both standalone and clustered ([YARN](yarn-jobs.html)) [deploymen |--- |--- |--- | |cluster-manager.container.retry.count|8|If a container fails, it is automatically restarted by Samza. However, if a container keeps failing shortly after startup, that indicates a deeper problem, so we should kill the job rather than retrying indefinitely. This property determines the maximum number of times we are willing to restart a failed container in quick succession (the time period is configured with `cluster-manager.container.retry.window.ms`). Each container in the job is counted separately. If this property is set to 0, any failed container immediately causes the whole job to fail. If it is set to a negative number, there is no limit on the number of retries.| |cluster-manager.container.retry.window.ms|300000|This property determines how frequently a container is allowed to fail before we give up and fail the job. If the same container has failed more than `cluster-manager.container.retry.count` times, and the time between failures was less than this property `cluster-manager.container.retry.window.ms` (in milliseconds), then we fail the job. There is no limit to the number of times we will restart a container if the time between failures is greater than `cluster-manager.container.retry.window.ms`.| +|cluster-manager.container.preferred-host.last.retry.delay.ms|360000|The delay of the last retry of `cluster-manager.container.retry.count` when `job.host-affinity.enabled` is true and the container is being requested to restart on a preferred host. This addresses the issue where there may be a delay when a preferred host is marked invalid and the container continuously attempts to restart and fail on the invalid preferred host. This property is useful to prevent the `cluster-manager.container.retry.count` from being exceeded too quickly for such scenarios.| Review comment: Nitpick: Perhaps the first sentence could be The delay added to the last retry for a failing container after all but one of `cluster-manager.container.retry.count` retries have been exhausted. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
