dnishimura commented on a change in pull request #1156: SAMZA-2323: Provide option allow single containers to fail without failing the job URL: https://github.com/apache/samza/pull/1156#discussion_r326292459
########## File path: docs/learn/documentation/versioned/jobs/samza-configurations.md ########## @@ -296,6 +296,7 @@ Samza supports both standalone and clustered ([YARN](yarn-jobs.html)) [deploymen |cluster-manager.container.retry.count|8|If a container fails, it is automatically restarted by Samza. However, if a container keeps failing shortly after startup, that indicates a deeper problem, so we should kill the job rather than retrying indefinitely. This property determines the maximum number of times we are willing to restart a failed container in quick succession (the time period is configured with `cluster-manager.container.retry.window.ms`). Each container in the job is counted separately. If this property is set to 0, any failed container immediately causes the whole job to fail. If it is set to a negative number, there is no limit on the number of retries.| |cluster-manager.container.retry.window.ms|300000|This property determines how frequently a container is allowed to fail before we give up and fail the job. If the same container has failed more than `cluster-manager.container.retry.count` times, and the time between failures was less than this property `cluster-manager.container.retry.window.ms` (in milliseconds), then we fail the job. There is no limit to the number of times we will restart a container if the time between failures is greater than `cluster-manager.container.retry.window.ms`.| |cluster-manager.container.preferred-host.last.retry.delay.ms|360000|The delay added to the last retry for a failing container after all but one of cluster-manager.container.retry.count retries have been exhausted. The delay is only added when `job.host-affinity.enabled` is true and the retried request is for a preferred host. This addresses the issue where there may be a delay when a preferred host is marked invalid and the container continuously attempts to restart and fail on the invalid preferred host. This property is useful to prevent the `cluster-manager.container.retry.count` from being exceeded too quickly for such scenarios.| +|cluster-manager.container.fail.job.after.retries|true|This configuration sets the behavior of the job after all `cluster-manager.container.retry.count`s are exhausted on a single container. If set to true, the whole job will fail if any container fails after the last retry. If set to false, the job will continue to run without the failed container. The typical use cases of setting this to false is to aid in debugging the cluster manager in cases of unexpected failed containers and also to allow other healthy containers to continue to run so that lag does not accumulate across all containers. Samza job operators should diligent in monitoring the `job-healthy` and `failed-containers` metrics when setting this configuration to false. A full restart of the job is required if another attempt to restart the container is needed after failure.| Review comment: Will change. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
