I can confirm this is real. If a flume master talking with (in my case 3) external ZK machines and is left running/idle for a long time, when you come back to the flume shell to execute a command, you will get this error. The only fix seems to be to bounce the flume master (cdh3u3). Seems extreme rather than it simply reconnecting an expired/timed-out zk connection.
FLUME-60 appears to be this issue, but it is currently unassigned and hasn't been updated in some time so no idea when/if a fix will come. Steve On Mon, May 7, 2012 at 6:42 PM, Eric Sammer <[email protected]> wrote: > Jay: > > It's unnecessary to ensure a client maintains a ZK connection. A heartbeat > mechanism is baked into the ZK session semantics. In other words, there's no > such thing as disconnecting from ZK due to inactivity since, in many > coordination algorithms, liveness (i.e. mere presence) is required for > correct functionality. You can prove this to yourself by reading > through http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions > > ...although the following paragraph is what you're looking for: > > "The session is kept alive by requests sent by the client. If the session is > idle for a period of time that would timeout the session, the client will > send a PING request to keep the session alive. This PING request not only > allows the ZooKeeper server to know that the client is still active, but it > also allows the client to verify that its connection to the ZooKeeper server > is still active. The timing of the PING is conservative enough to ensure > reasonable time to detect a dead connection and reconnect to a new server." > > Specifically, this bug is real, but not caused by idle disconnects. It would > be an error to attempt to "manage" the ZK session. You're not even supposed > to handle reconnects yourself with ZK (because of the herd effect); ZK > handles this by internally managing retries and then, upon successfully > reestablishing the connection, deciding if you are expired. > > On Mon, May 7, 2012 at 3:03 PM, Jay Stricks <[email protected]> wrote: >> >> I'm wondering how people ensure that their masters stay connected to the >> ZooKeeper server during long periods of time when no config changes are >> made. I'm referring specifically to the issues raised in FLUME-60 >> (https://issues.apache.org/jira/browse/FLUME-60): >> >> This seems related to long pauses or breakpoints. Disconnecting from ZK is >> probably reasonable in these conditions, but ideally the connection should >> be recovered. >> >> As an example, after a long pause, a command that modifies ZK state has >> this error message: >> >> Not connected to ZooKeeper: CLOSED >> >> >> I'm trying to think of possible solutions that don't require restarting >> the master. One idea is to have a test agent periodically issue >> configuration statements to each master, but are there any other ideas out >> there? >> >> Thanks, >> >> Jay > > > > > -- > Eric Sammer > twitter: esammer > data: www.cloudera.com
