nvollmar opened a new issue, #1182: URL: https://github.com/apache/incubator-pekko/issues/1182
I uncovered this investigating cluster issues on our nightly deployment test. Since we started to use a low power cpu governor during the night we started seeing issues of a Pekko cluster forming during the nightly deployment. I've tracked it down to the `TcpDnsClient` / `TcpConnection` initialization timing out, leaving it in a state it cannot recover from and never responding to any requests. The `TcpOutgoingConnection` is connecting and responds with a `Tcp.Connected` message to the `TcpDnsClient`, which in turn registers itself on the connection again: https://github.com/apache/incubator-pekko/blob/46e60a61fbabce5e3f36a408bfa3d1fb249eef44/actor/src/main/scala/org/apache/pekko/io/dns/internal/TcpDnsClient.scala#L52 If that message arrives late, the `TcpOutgoingConnection` will stop itself and `TcpDnsClient` has no detection or handling for this case: https://github.com/apache/incubator-pekko/blob/46e60a61fbabce5e3f36a408bfa3d1fb249eef44/actor/src/main/scala/org/apache/pekko/io/TcpConnection.scala#L104 This is a very unusual case, but it happens almost every deployment for one or two pods when the system is in low power mode. Proposed fix: `TcpDnsClient` must watch the connection and fail on termination to re-initialize (it is already handled by a backoff supervisor) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
