Kezhu, thanks for the great investigation! I am also +1 for the second approach. This seems like the right balance of resilience and compatibility with established applications.
Chris Nauroth On Thu, Apr 24, 2025 at 8:07 AM Kezhu Wang <kez...@apache.org> wrote: > Hi all, > > Before ZOOKEEPER-4508[1], clients could run into endless session > reestablishment if zk cluster is unavailable. > > 1. For established sessions, the client tries to reestablish the > session endlessly. Thus, the client will not get `Expired` until the > cluster comes back. > 2. In the case of no established session, the client tries to > establish a brand-new session endlessly. > > I authored pr-2058[2] to fix them both by issuing `Expired` after > approximately 4/3 * sessionTimeout as I think: > 1. The first is a bug from clients' perspective. `Expired` deserves > prompt awareness. > 2. Clients should be aware of failure after a period of time, so as > not to block clients indefinitely. > > In investigating ZOOKEEPER-4921[3], I found that Twitter Finagle[4] > probably depends on the endless retry to establish a brand-new > session. > > I think we have roads to go: > 1. Treat endless retry in case of unestablished session as a bug. This > will require third-party applications to fix themselves once they > depend on the endless retry. I could author a fix for Finagle. But I > doubt whether this endless retry is an already established feature for > some third libraries. > 2. Rollback to endless retry in 3.9.4. And provide customized > `newSessionTimeout` to control timeout to establish a brand-new > session through `ZooKeeperBuilder`[5] in 3.10. Together with > ZOOKEEPER-4508, we will get `Expired` for established sessions and > gain timeout control in establishing brand-new sessions without > breaking possible third party applications. > > I am kindly leaning towards the second approach. > > What do you think ? > > Best, > Kezhu > > [1]: https://issues.apache.org/jira/browse/ZOOKEEPER-4508 > [2]: https://github.com/apache/zookeeper/pull/2058 > [3]: https://issues.apache.org/jira/browse/ZOOKEEPER-4921 > [4]: > https://github.com/twitter/finagle/blob/develop/finagle-serversets/src/main/java/com/twitter/finagle/common/zookeeper/ZooKeeperClient.java#L357 > [5]: https://issues.apache.org/jira/browse/ZOOKEEPER-4697 >