Hi Fang,

I have verified the failure detection issue. It takes 10mins for recovery, if I kill the Node Manager process first. I will detail the experiment, in case I have make any mistakes.

The nodes arrangement is same as before.

Workload :
Every second, a python program generates a record with the system current time. And it sends the record 10 times to kafka topic "suyi-test-input".
        
Stream Task :
It gets the tuples from input stream and sends the tuples to the output stream "suyi-test-output". Checkpiont is disabled.

YARN Configuration :
        <property>
<description>How long to wait until a node manager is considered dead.</description>
                <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
                <value>600000</value>
        </property>
        <property>
<description>How often to check that node managers are still alive.</description>
                
<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
                <value>1000</value>
        </property>

Experiment Process :
(1) I submited the stream task, the application master ran on node "a", and the other container ran on node "b".
        (2) I killed the node manager process on node "b".
        (3) At about 09:59:07, I killed the container process on node "b".
        
Results :
(1) At 10:09:33 the application master tried to redeploy the lost container, according to application master logs.
        (2) The output during 09:59:35 to 10:09:53 is lost.
        
A fraction from application master logs:
2015-01-04 09:47:20 ContainerManagementProtocolProxy [INFO] Opening proxy : b:35889 2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Claimed task ID 0 for container container_1420335573203_0001_01_000002 on node b (http://b:8042/node/containerlogs/container_1420335573203_0001_01_000002).
        2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Started task ID 0
2015-01-04 09:57:19 ClientUtils$ [INFO] Fetching metadata from broker id:0,host:192.168.3.141,port:9092 with correlation id 41 for 1 topic(s) Set(metrics) 2015-01-04 09:57:19 SyncProducer [INFO] Connected to 192.168.3.141:9092 for producing 2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from 192.168.3.141:9092
        2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from a:9092
        2015-01-04 09:57:19 SyncProducer [INFO] Connected to a:9092 for 
producing
2015-01-04 10:07:19 ClientUtils$ [INFO] Fetching metadata from broker id:0,host:192.168.3.141,port:9092 with correlation id 82 for 1 topic(s) Set(metrics) 2015-01-04 10:07:19 SyncProducer [INFO] Connected to 192.168.3.141:9092 for producing 2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from 192.168.3.141:9092
        2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from a:9092
        2015-01-04 10:07:19 SyncProducer [INFO] Connected to a:9092 for 
producing
2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1420335573203_0001_01_000002 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Released container container_1420335573203_0001_01_000002 was assigned task ID 0. Requesting a new container for the task. 2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Requesting 1 container(s) with 1024mb of memory
        
A fraction from output with my comment added:
        {"time":"1420336774.0"}
        {"time":"1420336774.0"}
        {"time":"1420336774.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"} \\ 09:59:35
        {"time":"1420337393.0"} \\ 10:09:53
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        
For the Task redeployment issue, I worries that if the Resource Manager is busy or there are no available containers in the system, redeployment of failure task might be delayed.

Thank you for your help.

Su Yi

On Sat, 03 Jan 2015 06:04:20 +0800, Yan Fang <[email protected]> wrote:

Hi Su Yi,

I think there maybe a misunderstanding. For the failure detection, if the
containers die ( because of NM failure or whatever reason ), AM will bring
up new containers in the same NM or a different NM according to the
resource availability. It does not take as much as 10 mins to recover. One
way you can test is that, you run a Samza job and manually kill the NM or
the thread to see how quickly it recovers. In terms of how
yarn.nm.liveness-monitor.expiry-interval-ms
plays the role here, not very sure. Hope any yarn expert in the community
can explain it a little.

The goal of standby container in SAMZA-406 is to recover quickly when the
task has a lot of local state and so reading changelog takes a long time,
not to reduce the time of *allocating* the container, which, I believe, is
taken care by the YARN.

Hope this help a little. Thanks.

Cheers,

Fang, Yan
[email protected]
+1 (206) 849-4108

On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <[email protected]> wrote:

Hi Timothy,

There are 4 nodes in total : a,b,c,d
Resource manager : a
Node manager : a,b,c,d
Kafka and zookeeper running on : a

YARN configuration is :

<property>
    <description>How long to wait until a node manager is considered
dead.</description>
    <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
    <value>1000</value>
</property>

<property>
    <description>How often to check that node managers are still
alive.</description>
    <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
    <value>100</value>
</property>

From web UI of Samza, I found that node 'a' appeared and disappeared again
and again in the node list.

Su Yi

On 2015-01-01 02:54:48,"Timothy Chen" <[email protected]> wrote:

>Hi Su Yi,
>
>Can you elaborate a bit more what you mean by unstable cluster when
>you configured the heartbeat interval to be 1s?
>
>Tim
>
>On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
>> Hello,
>>
>> Here are some thoughts about HA of Samza.
>>
>> 1. Failure detection
>>
>> The problem is, failure detection of container completely depends on
YARN in Samza. YARN counts on Node Manager reporting container failures,
however Node Manager could fail, too (like, if the machine failed, NM would fail). Node Manager failures can be detected through heartbeat by Resource
Manager, but, by default it'll take 10 mins to confirm Node Manager
failure. I think, that's OK with batch processing, but not stream
processing.
>>
>> Configuring yarn failure confirm interval to 1s, result in an unstable yarn cluster(4 node in total). With 2s, all things works fine, but it takes
10s~20s to get lost container(machine shut down) back. Considering that
testing stream task is very simple(stateless), the recovery time is
relatively long.
>>
>> I am not an expert on YARN, I don't know why it, by default, takes such a long time to confirm node failure. To my understanding, YARN is something
trying to be general, and it is not sufficient for stream processing
framework. Extra effort should be done beyond YARN on failure detection in
stream processing.
>>
>> 2. Task redeployment
>>
>> After Resource Manager informed Samza of container failure, Samza
should apply for resources from YARN to redeploy failed tasks, which
consumes time during recovery. And, recovery time is critical for HA in
stream processing. I think, maintaining a few standby containers may
eliminate this overhead on recovery time. Samza could deploy failed tasks
on the standby containers than requesting from YARN.
>>
>> Hot standby containers, which is described in SAMZA-406(
https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
time, however it's costly(it doubles the resources needed).
>>
>> I'm wondering, what does these stuffs means to you, and how about the
feasibility. By the way, I'm using Samza 0.7 .
>>
>> Thank you for reading.
>>
>> Happy New Year!;-)
>>
>> Su Yi

Reply via email to