hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3442077021
The scenario you mentioned, where the agent quickly reconnects after a
server reboot, typically occurs when the server shuts down cleanly, allowing
TCP connections to terminate properly.
However, the problem we encountered primarily arises in unstable network
environments, leading to TCP connections entering a half-open state. In such
situations:
1. The server-side connection is terminated, but the client still
believes the connection is alive. This causes the client's send-Q to
continuously accumulate data, and the agent remains unaware that the connection
has become invalid, thus not triggering an automatic reconnection.
2. The role of gRPC keepalive: The purpose of introducing gRPC keepalive
is precisely to actively detect these half-open connections.By periodically
sending heartbeats, the agent can promptly discover connections that are
actually dead but still perceived as alive by the client, thereby forcing their
closure and initiating the reconnection process.
Regarding your point, "If nothing changed, there is no point to create a
new channel":
* Change in connection state: Even if the target backend address remains
unchanged, the internal state of the previous connection iscorrupted due to its
half-open status. In this scenario, simply reusing the old channel is
ineffective as it cannot recover.
* Necessity of forced reconnection: We observed that after keepalive
detected a connection failure, if the agent subsequently selected the same
backend, the original reconnection logic would not immediately force the
establishment of a new channel. Instead, it would wait for a long period
(approximately one hour) before attempting to reconnect. Therefore, modifying
the reconnection logic toensure that, upon detecting a connection failure, the
old channel is forcibly closed and a new `channel` is established, regardlessof
whether the same backend is selected, is crucial for ensuring timely connection
recovery and preventing prolonged serviceinterruptions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]