[ 
https://issues.apache.org/jira/browse/SAMZA-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hai Lu updated SAMZA-2266:
--------------------------
    Fix Version/s: 1.3

> Introduce a backoff when there are repeated failures for host-affinity 
> allocations
> ----------------------------------------------------------------------------------
>
>                 Key: SAMZA-2266
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2266
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Daniel Nishimura
>            Assignee: Daniel Nishimura
>            Priority: Major
>             Fix For: 1.3
>
>          Time Spent: 9h
>  Remaining Estimate: 0h
>
> The issue here is that we retry allocations of dead containers (and 
> repeatedly on subsequent failures) in a very small window of time (<1min). 
> It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.
> If a job has host-affinity enabled, this will cause us to allocate containers 
> on the same unhealthy host multiple times and eventually kill the application.
> This ticket is to evaluate the feasibility and possibly implement a fix that 
> involves introducing a time backoff on retries of container allocation on the 
> same host - so we eventually get a different host when the unhealthy NM's 
> status is updated.
> We may also want to look into the possibility of abandoning host-affinity on 
> the 8th attempt of restarting a container - so we don't kill the entire job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to