rmatharu commented on a change in pull request #1104: SAMZA-2266: Introduce a 
backoff when there are repeated failures for host-affinity allocations
URL: https://github.com/apache/samza/pull/1104#discussion_r309472980
 
 

 ##########
 File path: docs/learn/documentation/versioned/jobs/samza-configurations.md
 ##########
 @@ -292,6 +292,7 @@ Samza supports both standalone and clustered 
([YARN](yarn-jobs.html)) [deploymen
 |--- |--- |--- |
 |cluster-manager.container.retry.count|8|If a container fails, it is 
automatically restarted by Samza. However, if a container keeps failing shortly 
after startup, that indicates a deeper problem, so we should kill the job 
rather than retrying indefinitely. This property determines the maximum number 
of times we are willing to restart a failed container in quick succession (the 
time period is configured with `cluster-manager.container.retry.window.ms`). 
Each container in the job is counted separately. If this property is set to 0, 
any failed container immediately causes the whole job to fail. If it is set to 
a negative number, there is no limit on the number of retries.|
 |cluster-manager.container.retry.window.ms|300000|This property determines how 
frequently a container is allowed to fail before we give up and fail the job. 
If the same container has failed more than 
`cluster-manager.container.retry.count` times, and the time between failures 
was less than this property `cluster-manager.container.retry.window.ms` (in 
milliseconds), then we fail the job. There is no limit to the number of times 
we will restart a container if the time between failures is greater than 
`cluster-manager.container.retry.window.ms`.|
+|cluster-manager.container.preferred-host.last.retry.delay.ms|360000|The delay 
of the last retry of `cluster-manager.container.retry.count` when 
`job.host-affinity.enabled` is true and the container is being requested to 
restart on a preferred host. This addresses the issue where there may be a 
delay when a preferred host is marked invalid and the container continuously 
attempts to restart and fail on the invalid preferred host. This property is 
useful to prevent the `cluster-manager.container.retry.count` from being 
exceeded too quickly for such scenarios.|
 
 Review comment:
   Nitpick: Perhaps the first sentence could be
   
   The delay added to the last retry for a failing container after all but one 
of `cluster-manager.container.retry.count` retries have been exhausted.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to