Hello, Here are some thoughts about HA of Samza.
1. Failure detection The problem is, failure detection of container completely depends on YARN in Samza. YARN counts on Node Manager reporting container failures, however Node Manager could fail, too (like, if the machine failed, NM would fail). Node Manager failures can be detected through heartbeat by Resource Manager, but, by default it'll take 10 mins to confirm Node Manager failure. I think, that's OK with batch processing, but not stream processing. Configuring yarn failure confirm interval to 1s, result in an unstable yarn cluster(4 node in total). With 2s, all things works fine, but it takes 10s~20s to get lost container(machine shut down) back. Considering that testing stream task is very simple(stateless), the recovery time is relatively long. I am not an expert on YARN, I don't know why it, by default, takes such a long time to confirm node failure. To my understanding, YARN is something trying to be general, and it is not sufficient for stream processing framework. Extra effort should be done beyond YARN on failure detection in stream processing. 2. Task redeployment After Resource Manager informed Samza of container failure, Samza should apply for resources from YARN to redeploy failed tasks, which consumes time during recovery. And, recovery time is critical for HA in stream processing. I think, maintaining a few standby containers may eliminate this overhead on recovery time. Samza could deploy failed tasks on the standby containers than requesting from YARN. Hot standby containers, which is described in SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery time, however it's costly(it doubles the resources needed). I'm wondering, what does these stuffs means to you, and how about the feasibility. By the way, I'm using Samza 0.7 . Thank you for reading. Happy New Year!;-) Su Yi
