jasondhoyt opened a new issue, #3064:
URL: https://github.com/apache/helix/issues/3064

   ### Describe the bug
   
   We have a service that uses Helix (1.4.0) and it has been experiencing 
trouble when a network connection gets interrupted for the local helix agent on 
individual nodes within a cluster. Our helix agent processes are not always 
re-connecting and that has caused some issues with the service as a whole where 
we have to take more drastic steps in order to remediate the issue. Ideally the 
local helix agent should be able to reconnect properly but it is not. The 
exceptions below seem to be happening internal to the helix Java library.  We 
are also monitoring the `HelixManager.isConnected()` method but that does not 
seem to be returning false in these circumstances. If it was, our local agent 
would automatically restart which would have resolved the issue.  Is there any 
insight as to why the helix library is not re-connecting properly or else not 
reporting that bad connection?  Is this a potential bug within the helix 
library?
   
   ```
   2025-07-22 13:08:37,350 [main-SendThread(10.0.0.1:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard 
from server in 40003ms for session id 0x1d0000039b0f05e7
   2025-07-22 13:08:37,353 [main-SendThread(10.0.0.1:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 
10.0.0.1/10.0.0.1:2181, Closing socket connection. Attempting reconnect except 
it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session 
timed out, have not heard from server in 40003ms for session id 
0x1d0000039b0f05e7
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
   2025-07-22 13:08:37,472 [main-EventThread] INFO 
org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state 
changed ( Disconnected )
   2025-07-22 13:08:37,483 
[ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181]
 WARN org.apache.helix.manager.zk.ZKHelixManager:1272 - 
KeeperState:Disconnected, SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1, 
type: PARTICIPANT
   2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO 
org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server 
10.0.0.2/10.0.0.2:2181.
   2025-07-22 13:08:53,160 [main-SendThread(10.0.0.2:2181)] INFO 
org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to 
authenticate using SASL (unknown error)
   2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1257 - Client session timed out, have not heard 
from server in 27709ms for session id 0x1d0000039b0f05e7
   2025-07-22 13:09:05,176 [main-SendThread(10.0.0.2:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 
10.0.0.2/10.0.0.2:2181, Closing socket connection. Attempting reconnect except 
it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session 
timed out, have not heard from server in 27709ms for session id 
0x1d0000039b0f05e7
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [?:?]
   2025-07-22 13:09:10,520 [main-SendThread(10.0.0.3:2181)] INFO 
org.apache.zookeeper.ClientCnxn:1181 - Opening socket connection to server 
10.0.0.3/10.0.0.3:2181.
   2025-07-22 13:09:10,522 [main-SendThread(10.0.0.3:2181)] INFO 
org.apache.zookeeper.ClientCnxn:1183 - SASL config status: Will not attempt to 
authenticate using SASL (unknown error)
   2025-07-22 13:09:10,523 [main-SendThread(10.0.0.3:2181)] INFO 
org.apache.zookeeper.ClientCnxn:1013 - Socket connection established, 
initiating session, client: /10.0.0.6:43388, server: 10.0.0.3/10.0.0.3:2181
   2025-07-22 13:09:10,546 [main-EventThread] INFO 
org.apache.helix.zookeeper.zkclient.ZkClient:1561 - zkclient 2, zookeeper state 
changed ( Expired )
   2025-07-22 13:09:10,546 
[ZkClient-EventThread-47-10.0.0.3:2181,10.0.0.2:2181,10.0.0.4:2181,10.0.0.5:2181,10.0.0.1:2181]
 WARN org.apache.helix.manager.zk.ZKHelixManager:1272 - KeeperState:Expired, 
SessionId: 1d0000039b0f05e7, instance: 10.0.0.6_1, type: PARTICIPANT
   2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1433 - Unable to reconnect to ZooKeeper 
service, session 0x1d0000039b0f05e7 has expired
   2025-07-22 13:09:10,546 [main-SendThread(10.0.0.3:2181)] WARN 
org.apache.zookeeper.ClientCnxn:1300 - Session 0x1d0000039b0f05e7 for sever 
10.0.0.3/10.0.0.3:2181, Closing socket connection. Attempting reconnect except 
it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect 
to ZooKeeper service, session 0x1d0000039b0f05e7 has expired
           at 
org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1434) 
~[?:?]
           at 
org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
 ~[?:?]
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86) 
~[?:?]
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 ~[?:?]
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [?:?]
   ```
   
   ### To Reproduce
   
   Cause a network interruption between the Helix agent and the Zookeeper 
service.  Helix attempts to reconnect but does not report this connection 
failure back up through `HelixManager`.
   
   ### Expected behavior
   
   The `HelixManager.isConnected()` method should return false if there is a 
connection issue with the Zookeeper service.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to