Hi Su Yi,

Can you elaborate a bit more what you mean by unstable cluster when
you configured the heartbeat interval to be 1s?

Tim

On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
> Hello,
>
> Here are some thoughts about HA of Samza.
>
> 1. Failure detection
>
> The problem is, failure detection of container completely depends on YARN in 
> Samza. YARN counts on Node Manager reporting container failures, however Node 
> Manager could fail, too (like, if the machine failed, NM would fail). Node 
> Manager failures can be detected through heartbeat by Resource Manager, but, 
> by default it'll take 10 mins to confirm Node Manager failure. I think, 
> that's OK with batch processing, but not stream processing.
>
> Configuring yarn failure confirm interval to 1s, result in an unstable yarn 
> cluster(4 node in total). With 2s, all things works fine, but it takes 
> 10s~20s to get lost container(machine shut down) back. Considering that 
> testing stream task is very simple(stateless), the recovery time is 
> relatively long.
>
> I am not an expert on YARN, I don't know why it, by default, takes such a 
> long time to confirm node failure. To my understanding, YARN is something 
> trying to be general, and it is not sufficient for stream processing 
> framework. Extra effort should be done beyond YARN on failure detection in 
> stream processing.
>
> 2. Task redeployment
>
> After Resource Manager informed Samza of container failure, Samza should 
> apply for resources from YARN to redeploy failed tasks, which consumes time 
> during recovery. And, recovery time is critical for HA in stream processing. 
> I think, maintaining a few standby containers may eliminate this overhead on 
> recovery time. Samza could deploy failed tasks on the standby containers than 
> requesting from YARN.
>
> Hot standby containers, which is described in 
> SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save 
> recovery time, however it's costly(it doubles the resources needed).
>
> I'm wondering, what does these stuffs means to you, and how about the 
> feasibility. By the way, I'm using Samza 0.7 .
>
> Thank you for reading.
>
> Happy New Year!;-)
>
> Su Yi

Reply via email to