Hi Everyone, In order to fix the issue of orphaned/leaky containers seen when the YARN Node Manager crashes, I have created a SEP discussing the design for implementing a heartbeat between the containers and the job coordinator: https://cwiki.apache.org/confluence/display/SAMZA/SEP-3%3A+Heart-beat+mechanism+between+JobCoordinator+and+all+running+containers
Please take a look and provide feedback. I would also really appreciate help in designing a way to propagate the error up from SamzaContainer in order to exit the container with a non-zero exit code. Thanks, Abhishek