Re: Re: High Availability of Samza

Yi Su Sat, 03 Jan 2015 20:25:27 -0800

Hi Fang,

I have verified the failure detection issue. It takes 10mins for recovery,if I kill the Node Manager process first. I will detail the experiment, incase I have make any mistakes.


The nodes arrangement is same as before.

Workload :

Every second, a python program generates a record with the system currenttime. And it sends the record 10 times to kafka topic "suyi-test-input".

        
Stream Task :

It gets the tuples from input stream and sends the tuples to the outputstream "suyi-test-output". Checkpiont is disabled.


YARN Configuration :
        <property>

<description>How long to wait until a node manager is considereddead.</description>

                <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
                <value>600000</value>
        </property>
        <property>

<description>How often to check that node managers are stillalive.</description>

                
<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
                <value>1000</value>
        </property>

Experiment Process :

(1) I submited the stream task, the application master ran on node "a",and the other container ran on node "b".

        (2) I killed the node manager process on node "b".
        (3) At about 09:59:07, I killed the container process on node "b".
        
Results :

(1) At 10:09:33 the application master tried to redeploy the lostcontainer, according to application master logs.

        (2) The output during 09:59:35 to 10:09:53 is lost.
        
A fraction from application master logs:

2015-01-04 09:47:20 ContainerManagementProtocolProxy [INFO] Opening proxy: b:358892015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Claimed task ID 0for container container_1420335573203_0001_01_000002 on node b(http://b:8042/node/containerlogs/container_1420335573203_0001_01_000002).

        2015-01-04 09:47:20 SamzaAppMasterTaskManager [INFO] Started task ID 0

2015-01-04 09:57:19 ClientUtils$ [INFO] Fetching metadata from brokerid:0,host:192.168.3.141,port:9092 with correlation id 41 for 1 topic(s)Set(metrics)2015-01-04 09:57:19 SyncProducer [INFO] Connected to 192.168.3.141:9092for producing2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from192.168.3.141:9092

        2015-01-04 09:57:19 SyncProducer [INFO] Disconnecting from a:9092
        2015-01-04 09:57:19 SyncProducer [INFO] Connected to a:9092 for 
producing

2015-01-04 10:07:19 ClientUtils$ [INFO] Fetching metadata from brokerid:0,host:192.168.3.141,port:9092 with correlation id 82 for 1 topic(s)Set(metrics)2015-01-04 10:07:19 SyncProducer [INFO] Connected to 192.168.3.141:9092for producing2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from192.168.3.141:9092

        2015-01-04 10:07:19 SyncProducer [INFO] Disconnecting from a:9092
        2015-01-04 10:07:19 SyncProducer [INFO] Connected to a:9092 for 
producing

2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Got an exit code of-100. This means that container container_1420335573203_0001_01_000002 waskilled by YARN, either due to being released by the application master orbeing 'lost' due to node failures etc.2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Released containercontainer_1420335573203_0001_01_000002 was assigned task ID 0. Requestinga new container for the task.2015-01-04 10:09:33 SamzaAppMasterTaskManager [INFO] Requesting 1container(s) with 1024mb of memory

        
A fraction from output with my comment added:
        {"time":"1420336774.0"}
        {"time":"1420336774.0"}
        {"time":"1420336774.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"}
        {"time":"1420336775.0"} \\ 09:59:35
        {"time":"1420337393.0"} \\ 10:09:53
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337393.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}
        {"time":"1420337394.0"}

For the Task redeployment issue, I worries that if the Resource Manager isbusy or there are no available containers in the system, redeployment offailure task might be delayed.


Thank you for your help.

Su Yi

On Sat, 03 Jan 2015 06:04:20 +0800, Yan Fang <[email protected]> wrote:

Hi Su Yi,

I think there maybe a misunderstanding. For the failure detection, if the

containers die ( because of NM failure or whatever reason ), AM willbring

up new containers in the same NM or a different NM according to the

resource availability. It does not take as much as 10 mins to recover.One

way you can test is that, you run a Samza job and manually kill the NM or
the thread to see how quickly it recovers. In terms of how
yarn.nm.liveness-monitor.expiry-interval-ms
plays the role here, not very sure. Hope any yarn expert in the community
can explain it a little.

The goal of standby container in SAMZA-406 is to recover quickly when the
task has a lot of local state and so reading changelog takes a long time,

not to reduce the time of *allocating* the container, which, I believe,is

taken care by the YARN.

Hope this help a little. Thanks.

Cheers,

Fang, Yan
[email protected]
+1 (206) 849-4108

On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <[email protected]> wrote:

Hi Timothy,

There are 4 nodes in total : a,b,c,d
Resource manager : a
Node manager : a,b,c,d
Kafka and zookeeper running on : a

YARN configuration is :

<property>
    <description>How long to wait until a node manager is considered
dead.</description>
    <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
    <value>1000</value>
</property>

<property>
    <description>How often to check that node managers are still
alive.</description>
    <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
    <value>100</value>
</property>

From web UI of Samza, I found that node 'a' appeared and disappearedagain

and again in the node list.

Su Yi

On 2015-01-01 02:54:48，"Timothy Chen" <[email protected]> wrote：

>Hi Su Yi,
>
>Can you elaborate a bit more what you mean by unstable cluster when
>you configured the heartbeat interval to be 1s?
>
>Tim
>
>On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
>> Hello,
>>
>> Here are some thoughts about HA of Samza.
>>
>> 1. Failure detection
>>
>> The problem is, failure detection of container completely depends on
YARN in Samza. YARN counts on Node Manager reporting container failures,

however Node Manager could fail, too (like, if the machine failed, NMwouldfail). Node Manager failures can be detected through heartbeat byResource

Manager, but, by default it'll take 10 mins to confirm Node Manager
failure. I think, that's OK with batch processing, but not stream
processing.
>>

>> Configuring yarn failure confirm interval to 1s, result in anunstableyarn cluster(4 node in total). With 2s, all things works fine, but ittakes

10s~20s to get lost container(machine shut down) back. Considering that
testing stream task is very simple(stateless), the recovery time is
relatively long.
>>

>> I am not an expert on YARN, I don't know why it, by default, takessucha long time to confirm node failure. To my understanding, YARN issomething

trying to be general, and it is not sufficient for stream processing

framework. Extra effort should be done beyond YARN on failure detectionin

stream processing.
>>
>> 2. Task redeployment
>>
>> After Resource Manager informed Samza of container failure, Samza
should apply for resources from YARN to redeploy failed tasks, which
consumes time during recovery. And, recovery time is critical for HA in
stream processing. I think, maintaining a few standby containers may

eliminate this overhead on recovery time. Samza could deploy failedtasks

on the standby containers than requesting from YARN.
>>
>> Hot standby containers, which is described in SAMZA-406(
https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
time, however it's costly(it doubles the resources needed).
>>
>> I'm wondering, what does these stuffs means to you, and how about the
feasibility. By the way, I'm using Samza 0.7 .
>>
>> Thank you for reading.
>>
>> Happy New Year!;-)
>>
>> Su Yi

Re: Re: High Availability of Samza

Reply via email to