Thanks Mike (and Todd), that clears things up. I was not aware that the zookeeper locks were held by a separate process (ZKFC).
-Eric On Fri, Mar 14, 2014 at 4:24 PM, Mike Drob <[email protected]> wrote: > Replies from Todd Lipcon in-line. > > Mike > > ---------- Forwarded message ---------- > From: Todd Lipcon <[email protected]> > Date: Fri, Mar 14, 2014 at 4:14 PM > Subject: Re: HA namenode questions > > > I'm not on dev@accumulo upstream list anymore, but here's an answer. Feel > free to forward onto the public list (I've known Eric for a while) > > > ---------- Forwarded message ---------- > > From: Eric Newton <[email protected]> > > Date: Fri, Mar 14, 2014 at 3:18 PM > > Subject: HA namenode questions > > To: [email protected] > > > > > > For those of you running HA NN on large clusters, I'm looking for some > > advice. > > > > I was looking at an HA NN config today. Either by default, or by > following > > the configuration instructions, I saw that the zookeeper timeout was set > to > > 5 seconds. > > > > * is this a reasonable timeout? > > > > > Yes -- this timeout is only used from the ZKFC process, which is a very > lightweight process whose _only_ jobs are to (a) ping ZK, and (b) ping the > NN to check its health. It has on the order of a few MB of heap usage, so > should never GC. If it goes away longer than 5 seconds something is almost > certainly wrong with the machine or network. > > That said, if you would rather ride out a longer network blip (eg a switch > reboot) you could choose to make it longer. > > > > * do you provide HA NN its own set of zookeepers? > > > > > So long as the ZKs aren't ridiculously overloaded, sharing should be fine. > If you have a lot of un-tamed clients to some other ZK cluster, it's > probably best from an isolation perspective to run your own ensemble for HA > purposes. But, the ZK daemons could be colocated on the NNs + JT for > example so long as they get dedicated spindles. > > > > We have seen problems with large GC pauses with tablet servers. This > > happens less and less as we have learned more tricks, but I'm constantly > > talking to users who want their zookeeper timeout as high as two minutes. > > > > Yea, the ZKFC has no heap usage, so no GC. > > > > We have also had to increase the number of zookeepers on our largest > > clusters in order to handle the "thundering herd" load when large > > map/reduce jobs kick off and they all start talking to accumulo, which > > requires reading information from zookeeper. > > > > Clients today in HDFS HA don't ever talk to ZK, so the number of nodes > accessing ZK is limited to just the two NNs. > > > Any experience you can share about HA NN configuration at scales over > few > > hundred nodes would be appreciated. > > > > The ZK interaction should have no dependence on cluster size. The timeout > for how long it is expected to become active can have a dependence on > number of blocks in the cluster, but you should be able to see that by > doing some "practice failovers". We're working on making the > transitionToActive process quicker and more constant-time rather than > dependent on initializing block replication queues inline with the > failover. >
