Hi Su Yi,

I think there maybe a misunderstanding. For the failure detection, if the
containers die ( because of NM failure or whatever reason ), AM will bring
up new containers in the same NM or a different NM according to the
resource availability. It does not take as much as 10 mins to recover. One
way you can test is that, you run a Samza job and manually kill the NM or
the thread to see how quickly it recovers. In terms of how
yarn.nm.liveness-monitor.expiry-interval-ms
plays the role here, not very sure. Hope any yarn expert in the community
can explain it a little.

The goal of standby container in SAMZA-406 is to recover quickly when the
task has a lot of local state and so reading changelog takes a long time,
not to reduce the time of *allocating* the container, which, I believe, is
taken care by the YARN.

Hope this help a little. Thanks.

Cheers,

Fang, Yan
[email protected]
+1 (206) 849-4108

On Thu, Jan 1, 2015 at 4:20 AM, Su Yi <[email protected]> wrote:

> Hi Timothy,
>
> There are 4 nodes in total : a,b,c,d
> Resource manager : a
> Node manager : a,b,c,d
> Kafka and zookeeper running on : a
>
> YARN configuration is :
>
> <property>
>     <description>How long to wait until a node manager is considered
> dead.</description>
>     <name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
>     <value>1000</value>
> </property>
>
> <property>
>     <description>How often to check that node managers are still
> alive.</description>
>     <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
>     <value>100</value>
> </property>
>
> From web UI of Samza, I found that node 'a' appeared and disappeared again
> and again in the node list.
>
> Su Yi
>
> On 2015-01-01 02:54:48,"Timothy Chen" <[email protected]> wrote:
>
> >Hi Su Yi,
> >
> >Can you elaborate a bit more what you mean by unstable cluster when
> >you configured the heartbeat interval to be 1s?
> >
> >Tim
> >
> >On Wed, Dec 31, 2014 at 10:30 AM, Su Yi <[email protected]> wrote:
> >> Hello,
> >>
> >> Here are some thoughts about HA of Samza.
> >>
> >> 1. Failure detection
> >>
> >> The problem is, failure detection of container completely depends on
> YARN in Samza. YARN counts on Node Manager reporting container failures,
> however Node Manager could fail, too (like, if the machine failed, NM would
> fail). Node Manager failures can be detected through heartbeat by Resource
> Manager, but, by default it'll take 10 mins to confirm Node Manager
> failure. I think, that's OK with batch processing, but not stream
> processing.
> >>
> >> Configuring yarn failure confirm interval to 1s, result in an unstable
> yarn cluster(4 node in total). With 2s, all things works fine, but it takes
> 10s~20s to get lost container(machine shut down) back. Considering that
> testing stream task is very simple(stateless), the recovery time is
> relatively long.
> >>
> >> I am not an expert on YARN, I don't know why it, by default, takes such
> a long time to confirm node failure. To my understanding, YARN is something
> trying to be general, and it is not sufficient for stream processing
> framework. Extra effort should be done beyond YARN on failure detection in
> stream processing.
> >>
> >> 2. Task redeployment
> >>
> >> After Resource Manager informed Samza of container failure, Samza
> should apply for resources from YARN to redeploy failed tasks, which
> consumes time during recovery. And, recovery time is critical for HA in
> stream processing. I think, maintaining a few standby containers may
> eliminate this overhead on recovery time. Samza could deploy failed tasks
> on the standby containers than requesting from YARN.
> >>
> >> Hot standby containers, which is described in SAMZA-406(
> https://issues.apache.org/jira/browse/SAMZA-406), may help save recovery
> time, however it's costly(it doubles the resources needed).
> >>
> >> I'm wondering, what does these stuffs means to you, and how about the
> feasibility. By the way, I'm using Samza 0.7 .
> >>
> >> Thank you for reading.
> >>
> >> Happy New Year!;-)
> >>
> >> Su Yi
>

Reply via email to