Replies from Todd Lipcon in-line. Mike
---------- Forwarded message ---------- From: Todd Lipcon <[email protected]> Date: Fri, Mar 14, 2014 at 4:14 PM Subject: Re: HA namenode questions I'm not on dev@accumulo upstream list anymore, but here's an answer. Feel free to forward onto the public list (I've known Eric for a while) ---------- Forwarded message ---------- > From: Eric Newton <[email protected]> > Date: Fri, Mar 14, 2014 at 3:18 PM > Subject: HA namenode questions > To: [email protected] > > > For those of you running HA NN on large clusters, I'm looking for some > advice. > > I was looking at an HA NN config today. Either by default, or by following > the configuration instructions, I saw that the zookeeper timeout was set to > 5 seconds. > > * is this a reasonable timeout? > > Yes -- this timeout is only used from the ZKFC process, which is a very lightweight process whose _only_ jobs are to (a) ping ZK, and (b) ping the NN to check its health. It has on the order of a few MB of heap usage, so should never GC. If it goes away longer than 5 seconds something is almost certainly wrong with the machine or network. That said, if you would rather ride out a longer network blip (eg a switch reboot) you could choose to make it longer. > * do you provide HA NN its own set of zookeepers? > > So long as the ZKs aren't ridiculously overloaded, sharing should be fine. If you have a lot of un-tamed clients to some other ZK cluster, it's probably best from an isolation perspective to run your own ensemble for HA purposes. But, the ZK daemons could be colocated on the NNs + JT for example so long as they get dedicated spindles. > We have seen problems with large GC pauses with tablet servers. This > happens less and less as we have learned more tricks, but I'm constantly > talking to users who want their zookeeper timeout as high as two minutes. > > Yea, the ZKFC has no heap usage, so no GC. > We have also had to increase the number of zookeepers on our largest > clusters in order to handle the "thundering herd" load when large > map/reduce jobs kick off and they all start talking to accumulo, which > requires reading information from zookeeper. > > Clients today in HDFS HA don't ever talk to ZK, so the number of nodes accessing ZK is limited to just the two NNs. > Any experience you can share about HA NN configuration at scales over few > hundred nodes would be appreciated. > > The ZK interaction should have no dependence on cluster size. The timeout for how long it is expected to become active can have a dependence on number of blocks in the cluster, but you should be able to see that by doing some "practice failovers". We're working on making the transitionToActive process quicker and more constant-time rather than dependent on initializing block replication queues inline with the failover.
