Hi Timothy,

There are 4 nodes in total : a,b,c,d
Resource manager : a
Node manager : a,b,c,d
Kafka and zookeeper running on : a

YARN configuration is :

<property>
    <description>How long to wait until a node manager is considered 
dead.</description>
    <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
    <value>1000</value>
</property>

<property>
    <description>How often to check that node managers are still 
alive.</description>
    <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
    <value>100</value>
</property>

>From web UI of Samza, I found that node 'a' appeared and disappeared again and 
>again in the node list.

Su Yi

On 2015-01-01 02:54:48,"Timothy Chen" <[email protected]> wrote:

>Hi Su Yi,
>
>Can you elaborate a bit more what you mean by unstable cluster when
>you configured the heartbeat interval to be 1s?
>
>Tim
>
>On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
>> Hello,
>>
>> Here are some thoughts about HA of Samza.
>>
>> 1. Failure detection
>>
>> The problem is, failure detection of container completely depends on YARN in 
>> Samza. YARN counts on Node Manager reporting container failures, however 
>> Node Manager could fail, too (like, if the machine failed, NM would fail). 
>> Node Manager failures can be detected through heartbeat by Resource Manager, 
>> but, by default it'll take 10 mins to confirm Node Manager failure. I think, 
>> that's OK with batch processing, but not stream processing.
>>
>> Configuring yarn failure confirm interval to 1s, result in an unstable yarn 
>> cluster(4 node in total). With 2s, all things works fine, but it takes 
>> 10s~20s to get lost container(machine shut down) back. Considering that 
>> testing stream task is very simple(stateless), the recovery time is 
>> relatively long.
>>
>> I am not an expert on YARN, I don't know why it, by default, takes such a 
>> long time to confirm node failure. To my understanding, YARN is something 
>> trying to be general, and it is not sufficient for stream processing 
>> framework. Extra effort should be done beyond YARN on failure detection in 
>> stream processing.
>>
>> 2. Task redeployment
>>
>> After Resource Manager informed Samza of container failure, Samza should 
>> apply for resources from YARN to redeploy failed tasks, which consumes time 
>> during recovery. And, recovery time is critical for HA in stream processing. 
>> I think, maintaining a few standby containers may eliminate this overhead on 
>> recovery time. Samza could deploy failed tasks on the standby containers 
>> than requesting from YARN.
>>
>> Hot standby containers, which is described in 
>> SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save 
>> recovery time, however it's costly(it doubles the resources needed).
>>
>> I'm wondering, what does these stuffs means to you, and how about the 
>> feasibility. By the way, I'm using Samza 0.7 .
>>
>> Thank you for reading.
>>
>> Happy New Year!;-)
>>
>> Su Yi

Reply via email to