Hello,

Here are some thoughts about HA of Samza.

1. Failure detection

The problem is, failure detection of container completely depends on YARN in 
Samza. YARN counts on Node Manager reporting container failures, however Node 
Manager could fail, too (like, if the machine failed, NM would fail). Node 
Manager failures can be detected through heartbeat by Resource Manager, but, by 
default it'll take 10 mins to confirm Node Manager failure. I think, that's OK 
with batch processing, but not stream processing.

Configuring yarn failure confirm interval to 1s, result in an unstable yarn 
cluster(4 node in total). With 2s, all things works fine, but it takes 10s~20s 
to get lost container(machine shut down) back. Considering that testing stream 
task is very simple(stateless), the recovery time is relatively long.

I am not an expert on YARN, I don't know why it, by default, takes such a long 
time to confirm node failure. To my understanding, YARN is something trying to 
be general, and it is not sufficient for stream processing framework. Extra 
effort should be done beyond YARN on failure detection in stream processing.

2. Task redeployment

After Resource Manager informed Samza of container failure, Samza should apply 
for resources from YARN to redeploy failed tasks, which consumes time during 
recovery. And, recovery time is critical for HA in stream processing. I think, 
maintaining a few standby containers may eliminate this overhead on recovery 
time. Samza could deploy failed tasks on the standby containers than requesting 
from YARN.

Hot standby containers, which is described in 
SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save 
recovery time, however it's costly(it doubles the resources needed).

I'm wondering, what does these stuffs means to you, and how about the 
feasibility. By the way, I'm using Samza 0.7 .

Thank you for reading.

Happy New Year!;-)

Su Yi

Reply via email to