I can confirm this is real.  If a flume master talking with (in my
case 3) external ZK machines and is left running/idle for a long time,
when you come back to the flume shell to execute a command, you will
get this error.  The only fix seems to be to bounce the flume master
(cdh3u3).  Seems extreme rather than it simply reconnecting an
expired/timed-out zk connection.

FLUME-60 appears to be this issue, but it is currently unassigned and
hasn't been updated in some time so no idea when/if a fix will come.

Steve

On Mon, May 7, 2012 at 6:42 PM, Eric Sammer <[email protected]> wrote:
> Jay:
>
> It's unnecessary to ensure a client maintains a ZK connection. A heartbeat
> mechanism is baked into the ZK session semantics. In other words, there's no
> such thing as disconnecting from ZK due to inactivity since, in many
> coordination algorithms, liveness (i.e. mere presence) is required for
> correct functionality. You can prove this to yourself by reading
> through http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkSessions
>
> ...although the following paragraph is what you're looking for:
>
> "The session is kept alive by requests sent by the client. If the session is
> idle for a period of time that would timeout the session, the client will
> send a PING request to keep the session alive. This PING request not only
> allows the ZooKeeper server to know that the client is still active, but it
> also allows the client to verify that its connection to the ZooKeeper server
> is still active. The timing of the PING is conservative enough to ensure
> reasonable time to detect a dead connection and reconnect to a new server."
>
> Specifically, this bug is real, but not caused by idle disconnects. It would
> be an error to attempt to "manage" the ZK session. You're not even supposed
> to handle reconnects yourself with ZK (because of the herd effect); ZK
> handles this by internally managing retries and then, upon successfully
> reestablishing the connection, deciding if you are expired.
>
> On Mon, May 7, 2012 at 3:03 PM, Jay Stricks <[email protected]> wrote:
>>
>> I'm wondering how people ensure that their masters stay connected to the
>> ZooKeeper server during long periods of time when no config changes are
>> made. I'm referring specifically to the issues raised in FLUME-60
>> (https://issues.apache.org/jira/browse/FLUME-60):
>>
>> This seems related to long pauses or breakpoints. Disconnecting from ZK is
>> probably reasonable in these conditions, but ideally the connection should
>> be recovered.
>>
>> As an example, after a long pause, a command that modifies ZK state has
>> this error message:
>>
>> Not connected to ZooKeeper: CLOSED
>>
>>
>> I'm trying to think of possible solutions that don't require restarting
>> the master. One idea is to have a test agent periodically issue
>> configuration statements to each master, but are there any other ideas out
>> there?
>>
>> Thanks,
>>
>> Jay
>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com

Reply via email to