jasondhoyt opened a new issue, #3064:
URL: https://github.com/apache/helix/issues/3064
### Describe the bug
We have a service that uses Helix (1.4.0) and it has been experiencing
trouble when a network connection gets interrupted for the local helix agent on
individual nodes within a cluster. Our helix agent processes are not always
re-connecting and that has caused some issues with the service as a whole where
we have to take more drastic steps in order to remediate the issue. Ideally the
local helix agent should be able to reconnect properly but it is not. The
exceptions below seem to be happening internal to the helix Java library. We
are also monitoring the `HelixManager.isConnected()` method but that does not
seem to be returning false in these circumstances. If it was, our local agent
would automatically restart which would have resolved the issue. Is there any
insight as to why the helix library is not re-connecting properly or else not
reporting that bad connection? Is this a potential bug within the helix
library?
```
2025-07-22 13:08:37,350 [main-SendThread(10.0.0.1:2181)] WARN
org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard
from server in 40003ms for session id 0x1d0000039b0f05e7
2025-07-22 13:08:37,353 [main-SendThread(10.0.0.1:2181)] WARN
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever
10.0.0.1/10.0.0.1:2181, Closing socket connection. Attempting reconnect except
it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 40003ms for session id
0x1d0000039b0f05e7
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
2025-07-22 13:08:37,472 [main-EventThread] INFO
org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state
changed ( Disconnected )
2025-07-22 13:08:37,483
[ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181]
WARN org.apache.helix.manager.zk.ZKHelixManager:1272 -
KeeperState:Disconnected, SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1,
type: PARTICIPANT
2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO
org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server
10.0.0.2/10.0.0.2:2181.
2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO
org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to
authenticate using SASL (unknown error)
2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN
org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard
from server in 27709ms for session id 0x1d0000039b0f05e7
2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever
10.0.0.2/10.0.0.2:2181, Closing socket connection. Attempting reconnect except
it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session
timed out, have not heard from server in 27709ms for session id
0x1d0000039b0f05e7
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
2025-07-22 13:09:10,520 [main-SendThread(10.0.0.3:2181)] INFO
org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server
10.0.0.3/10.0.0.3:2181.
2025-07-22 13:09:10,522 [main-SendThread(10.0.0.3:2181)] INFO
org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to
authenticate using SASL (unknown error)
2025-07-22 13:09:10,523 [main-SendThread(10.0.0.3:2181)] INFO
org.apache.zookeeper.ClientCnxn:1013 - Socket connection established,
initiating session, client: /10.0.0.6:43388, server: 10.0.0.3/10.0.0.3:2181
2025-07-22 13:09:10,546 [main-EventThread] INFO
org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state
changed ( Expired )
2025-07-22 13:09:10,546
[ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181]
WARN org.apache.helix.manager.zk.ZKHelixManager:1272 - KeeperState:Expired,
SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1, type: PARTICIPANT
2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN
org.apache.zookeeper.ClientCnxn:1433 - Unable to reconnect to ZooKeeper
service, session 0x1d0000039b0f05e7 has expired
2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever
10.0.0.3/10.0.0.3:2181, Closing socket connection. Attempting reconnect except
it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect
to ZooKeeper service, session 0x1d0000039b0f05e7 has expired
at
org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1434)
~[?:?]
at
org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
~[?:?]
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86)
~[?:?]
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
~[?:?]
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [?:?]
```
### To Reproduce
Cause a network interruption between the Helix agent and the Zookeeper
service. Helix attempts to reconnect but does not report this connection
failure back up through `HelixManager`.
### Expected behavior
The `HelixManager.isConnected()` method should return false if there is a
connection issue with the Zookeeper service.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]