dnishimura opened a new pull request #1104: SAMZA-2266: Introduce a backoff 
when there are repeated failures for host-affinity allocations
URL: https://github.com/apache/samza/pull/1104
 
 
   **Motivation**
   For host-affinity enabled jobs, a bad physical host may not immediately be 
marked as invalid by the Resource Manager (RM). As a result, when the 
`HostAwareContainerAllocator` requests preferred hosts, the RM generates the 
`onResourceCompleted` callback even though the host can't be allocated. The 
status error in the `onResourceCompleted` is equivalent to an application error 
and the retry logic kicks in to restart the failed container. Adding delays in 
the retry logic will prevent the job from failing prematurely (after 8 retries) 
before the bad host is marked invalid.
   
   **Implementation notes**
   Added an exponential back-off with a max delay. Container allocation 
requests are put in a priority queue with the priority determined by type and 
request timestamp. For retries that have a delay, I set the request timestamp 
in the future by time X where X is the calculated back-off.
   
   **Testing**
   Unit tests and tested a Samza job on a YARN cluster. I simulated the 
scenario by forcing an uncaught exception in a few containers to force the 
containers to fail.
   
   @rmatharu @abhishekshivanna and others please take a look

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to