gharris1727 commented on code in PR #15305: URL: https://github.com/apache/kafka/pull/15305#discussion_r1591278453
########## connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java: ########## @@ -267,6 +267,18 @@ public String memberId() { return JoinGroupRequest.UNKNOWN_MEMBER_ID; } + @Override + protected void handlePollTimeoutExpiry() { + log.warn("worker poll timeout has expired. This means the time between subsequent calls to poll() " + + "in DistributedHerder tick() method was longer than the configured rebalance.timeout.ms. " + + "If you see this happening consistently, then it can be addressed by either adding more workers " + + "to the connect cluster or by increasing the rebalance.timeout.ms configuration value. Please note that " + Review Comment: I think this is decent advice when requests are small and can be distributed around the cluster, but as REST requests are rather infrequent, I think this is the minority of cases. I think most often this timeout is going to be triggered by an excessively slow connector start, stop, or validation. In those cases, adding more workers does nothing but move the error to a different worker. I think we can keep the "adding more workers" comment, if we include another piece of advice for debugging excessively blocking tasks. If we don't have that other piece of advice, then advising users to add workers is misleading. ########## connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java: ########## @@ -267,6 +267,18 @@ public String memberId() { return JoinGroupRequest.UNKNOWN_MEMBER_ID; } + @Override + protected void handlePollTimeoutExpiry() { Review Comment: Since we (as maintainers) don't have good insight into what commonly causes the herder tick thread to block and the poll timeout to fire, we recently added https://issues.apache.org/jira/browse/KAFKA-15563 to help users debug it themselves. It would be nice to integrate with this system to have the heartbeat thread report what the herder tick thread was blocked on at the time that the poll timeout happened, as this would report stalling that isn't caused by REST requests. The integration is tricky though, because the WorkerCoordinator is (and should be) unaware of the DistributedHerder. And currently I think the WorkerCoordinator hides these internal disconnects and reconnects inside of the poll method. Perhaps we can extend the WorkerRebalanceListener or have a new error listener to allow the herder to be informed about these errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org