Hi Timothy,
There are 4 nodes in total : a,b,c,d
Resource manager : a
Node manager : a,b,c,d
Kafka and zookeeper running on : a
YARN configuration is :
<property>
<description>How long to wait until a node manager is considered
dead.</description>
<name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
<value>1000</value>
</property>
<property>
<description>How often to check that node managers are still
alive.</description>
<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
<value>100</value>
</property>
>From web UI of Samza, I found that node 'a' appeared and disappeared again and
>again in the node list.
Su Yi
On 2015-01-01 02:54:48,"Timothy Chen" <[email protected]> wrote:
>Hi Su Yi,
>
>Can you elaborate a bit more what you mean by unstable cluster when
>you configured the heartbeat interval to be 1s?
>
>Tim
>
>On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
>> Hello,
>>
>> Here are some thoughts about HA of Samza.
>>
>> 1. Failure detection
>>
>> The problem is, failure detection of container completely depends on YARN in
>> Samza. YARN counts on Node Manager reporting container failures, however
>> Node Manager could fail, too (like, if the machine failed, NM would fail).
>> Node Manager failures can be detected through heartbeat by Resource Manager,
>> but, by default it'll take 10 mins to confirm Node Manager failure. I think,
>> that's OK with batch processing, but not stream processing.
>>
>> Configuring yarn failure confirm interval to 1s, result in an unstable yarn
>> cluster(4 node in total). With 2s, all things works fine, but it takes
>> 10s~20s to get lost container(machine shut down) back. Considering that
>> testing stream task is very simple(stateless), the recovery time is
>> relatively long.
>>
>> I am not an expert on YARN, I don't know why it, by default, takes such a
>> long time to confirm node failure. To my understanding, YARN is something
>> trying to be general, and it is not sufficient for stream processing
>> framework. Extra effort should be done beyond YARN on failure detection in
>> stream processing.
>>
>> 2. Task redeployment
>>
>> After Resource Manager informed Samza of container failure, Samza should
>> apply for resources from YARN to redeploy failed tasks, which consumes time
>> during recovery. And, recovery time is critical for HA in stream processing.
>> I think, maintaining a few standby containers may eliminate this overhead on
>> recovery time. Samza could deploy failed tasks on the standby containers
>> than requesting from YARN.
>>
>> Hot standby containers, which is described in
>> SAMZA-406(https://issues.apache.org/jira/browse/SAMZA-406), may help save
>> recovery time, however it's costly(it doubles the resources needed).
>>
>> I'm wondering, what does these stuffs means to you, and how about the
>> feasibility. By the way, I'm using Samza 0.7 .
>>
>> Thank you for reading.
>>
>> Happy New Year!;-)
>>
>> Su Yi