[GitHub] [samza] dnishimura commented on a change in pull request #1156: SAMZA-2323: Provide option allow single containers to fail without failing the job

GitBox Fri, 20 Sep 2019 09:19:10 -0700

dnishimura commented on a change in pull request #1156: SAMZA-2323: Provide 
option allow single containers to fail without failing the job
URL: https://github.com/apache/samza/pull/1156#discussion_r326292459


 ##########
 File path: docs/learn/documentation/versioned/jobs/samza-configurations.md
 ##########
 @@ -296,6 +296,7 @@ Samza supports both standalone and clustered 
([YARN](yarn-jobs.html)) [deploymen
 |cluster-manager.container.retry.count|8|If a container fails, it is 
automatically restarted by Samza. However, if a container keeps failing shortly 
after startup, that indicates a deeper problem, so we should kill the job 
rather than retrying indefinitely. This property determines the maximum number 
of times we are willing to restart a failed container in quick succession (the 
time period is configured with `cluster-manager.container.retry.window.ms`). 
Each container in the job is counted separately. If this property is set to 0, 
any failed container immediately causes the whole job to fail. If it is set to 
a negative number, there is no limit on the number of retries.|
 |cluster-manager.container.retry.window.ms|300000|This property determines how 
frequently a container is allowed to fail before we give up and fail the job. 
If the same container has failed more than 
`cluster-manager.container.retry.count` times, and the time between failures 
was less than this property `cluster-manager.container.retry.window.ms` (in 
milliseconds), then we fail the job. There is no limit to the number of times 
we will restart a container if the time between failures is greater than 
`cluster-manager.container.retry.window.ms`.|
 |cluster-manager.container.preferred-host.last.retry.delay.ms|360000|The delay 
added to the last retry for a failing container after all but one of 
cluster-manager.container.retry.count retries have been exhausted. The delay is 
only added when `job.host-affinity.enabled` is true and the retried request is 
for a preferred host. This addresses the issue where there may be a delay when 
a preferred host is marked invalid and the container continuously attempts to 
restart and fail on the invalid preferred host. This property is useful to 
prevent the `cluster-manager.container.retry.count` from being exceeded too 
quickly for such scenarios.|
+|cluster-manager.container.fail.job.after.retries|true|This configuration sets 
the behavior of the job after all `cluster-manager.container.retry.count`s are 
exhausted on a single container. If set to true, the whole job will fail if any 
container fails after the last retry. If set to false, the job will continue to 
run without the failed container. The typical use cases of setting this to 
false is to aid in debugging the cluster manager in cases of unexpected failed 
containers and also to allow other healthy containers to continue to run so 
that lag does not accumulate across all containers. Samza job operators should 
diligent in monitoring the `job-healthy` and `failed-containers` metrics when 
setting this configuration to false. A full restart of the job is required if 
another attempt to restart the container is needed after failure.|
 
 Review comment:
   Will change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [samza] dnishimura commented on a change in pull request #1156: SAMZA-2323: Provide option allow single containers to fail without failing the job

Reply via email to