[
https://issues.apache.org/jira/browse/SAMZA-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hai Lu updated SAMZA-2266:
--------------------------
Fix Version/s: 1.3
> Introduce a backoff when there are repeated failures for host-affinity
> allocations
> ----------------------------------------------------------------------------------
>
> Key: SAMZA-2266
> URL: https://issues.apache.org/jira/browse/SAMZA-2266
> Project: Samza
> Issue Type: Bug
> Reporter: Daniel Nishimura
> Assignee: Daniel Nishimura
> Priority: Major
> Fix For: 1.3
>
> Time Spent: 9h
> Remaining Estimate: 0h
>
> The issue here is that we retry allocations of dead containers (and
> repeatedly on subsequent failures) in a very small window of time (<1min).
> It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.
> If a job has host-affinity enabled, this will cause us to allocate containers
> on the same unhealthy host multiple times and eventually kill the application.
> This ticket is to evaluate the feasibility and possibly implement a fix that
> involves introducing a time backoff on retries of container allocation on the
> same host - so we eventually get a different host when the unhealthy NM's
> status is updated.
> We may also want to look into the possibility of abandoning host-affinity on
> the 8th attempt of restarting a container - so we don't kill the entire job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)