Hi Su Yi, Can you elaborate a bit more what you mean by unstable cluster when you configured the heartbeat interval to be 1s?
Tim On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote: > Hello, > > Here are some thoughts about HA of Samza. > > 1. Failure detection > > The problem is, failure detection of container completely depends on YARN in > Samza. YARN counts on Node Manager reporting container failures, however Node > Manager could fail, too (like, if the machine failed, NM would fail). Node > Manager failures can be detected through heartbeat by Resource Manager, but, > by default it'll take 10 mins to confirm Node Manager failure. I think, > that's OK with batch processing, but not stream processing. > > Configuring yarn failure confirm interval to 1s, result in an unstable yarn > cluster(4 node in total). With 2s, all things works fine, but it takes > 10s~20s to get lost container(machine shut down) back. Considering that > testing stream task is very simple(stateless), the recovery time is > relatively long. > > I am not an expert on YARN, I don't know why it, by default, takes such a > long time to confirm node failure. To my understanding, YARN is something > trying to be general, and it is not sufficient for stream processing > framework. Extra effort should be done beyond YARN on failure detection in > stream processing. > > 2. Task redeployment > > After Resource Manager informed Samza of container failure, Samza should > apply for resources from YARN to redeploy failed tasks, which consumes time > during recovery. And, recovery time is critical for HA in stream processing. > I think, maintaining a few standby containers may eliminate this overhead on > recovery time. Samza could deploy failed tasks on the standby containers than > requesting from YARN. > > Hot standby containers, which is described in > SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save > recovery time, however it's costly(it doubles the resources needed). > > I'm wondering, what does these stuffs means to you, and how about the > feasibility. By the way, I'm using Samza 0.7 . > > Thank you for reading. > > Happy New Year!;-) > > Su Yi
