GitHub user abhishekshivanna opened a pull request: https://github.com/apache/samza/pull/375
SAMZA-1506: Fix for robust ContainerHeartbeatMonitor exception handling. The Fix includes the following changes: - Catch all exceptions inside the heartbeat thread and not just IOException. - A time based force kill when the heartbeat is invalid, this makes the monitor immune to threads that may keep the container stuck in the shutdown sequence. When the timeout occurs, a System.exit(1) is called. - Increasing number of retries for failed heartbeats from 3 to 6. This prevents short intermittent network failurs from causing the containers to be invalidated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/abhishekshivanna/samza container-heartbeat Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/375.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #375 ---- commit 55145366b0a2e15b30665e88cead5f6bfd75ee2e Author: Abhishek Shivanna <abhishek...@gmail.com> Date: 2017-11-30T20:09:10Z SAMZA-1506: Fix for robust ContainerHeartbeatMonitor exception handling. The Fix includes the following changes: - Catch all exceptions inside the heartbeat thread and not just IOException. - A time based force kill when the heartbeat is invalid, this makes the monitor immune to threads that may keep the container stuck in the shutdown sequence. When the timeout occurs, a System.exit(1) is called. - Increasing number of retries for failed heartbeats from 3 to 6. This prevents short intermittent network failurs from causing the containers to be invalidated. ---- ---