Fwd: HA namenode questions

Mike Drob Fri, 14 Mar 2014 13:25:06 -0700

Replies from Todd Lipcon in-line.

Mike


---------- Forwarded message ----------
From: Todd Lipcon <[email protected]>
Date: Fri, Mar 14, 2014 at 4:14 PM
Subject: Re: HA namenode questions


I'm not on dev@accumulo upstream list anymore, but here's an answer. Feel
free to forward onto the public list (I've known Eric for a while)


---------- Forwarded message ----------
> From: Eric Newton <[email protected]>
> Date: Fri, Mar 14, 2014 at 3:18 PM
> Subject: HA namenode questions
> To: [email protected]
>
>
> For those of you running HA NN on large clusters, I'm looking for some
> advice.
>
> I was looking at an HA NN config today.  Either by default, or by following
> the configuration instructions, I saw that the zookeeper timeout was set to
> 5 seconds.
>
> * is this a reasonable timeout?
>
>
Yes -- this timeout is only used from the ZKFC process, which is a very
lightweight process whose _only_ jobs are to (a) ping ZK, and (b) ping the
NN to check its health. It has on the order of a few MB of heap usage, so
should never GC. If it goes away longer than 5 seconds something is almost
certainly wrong with the machine or network.

That said, if you would rather ride out a longer network blip (eg a switch
reboot) you could choose to make it longer.


>  * do you provide HA NN its own set of zookeepers?
>
>
So long as the ZKs aren't ridiculously overloaded, sharing should be fine.
If you have a lot of un-tamed clients to some other ZK cluster, it's
probably best from an isolation perspective to run your own ensemble for HA
purposes. But, the ZK daemons could be colocated on the NNs + JT for
example so long as they get dedicated spindles.


>  We have seen problems with large GC pauses with tablet servers.  This
> happens less and less as we have learned more tricks, but I'm constantly
> talking to users who want their zookeeper timeout as high as two minutes.
>
> Yea, the ZKFC has no heap usage, so no GC.


>  We have also had to increase the number of zookeepers on our largest
> clusters in order to handle the "thundering herd" load when large
> map/reduce jobs kick off and they all start talking to accumulo, which
> requires reading information from zookeeper.
>
> Clients today in HDFS HA don't ever talk to ZK, so the number of nodes
accessing ZK is limited to just the two NNs.

>  Any experience you can share about HA NN configuration at scales over few
> hundred nodes would be appreciated.
>
> The ZK interaction should have no dependence on cluster size. The timeout
for how long it is expected to become active can have a dependence on
number of blocks in the cluster, but you should be able to see that by
doing some "practice failovers". We're working on making the
transitionToActive process quicker and more constant-time rather than
dependent on initializing block replication queues inline with the failover.

Fwd: HA namenode questions

Reply via email to