gharris1727 commented on code in PR #15305:
URL: https://github.com/apache/kafka/pull/15305#discussion_r1591278453


##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java:
##########
@@ -267,6 +267,18 @@ public String memberId() {
         return JoinGroupRequest.UNKNOWN_MEMBER_ID;
     }
 
+    @Override
+    protected void handlePollTimeoutExpiry() {
+        log.warn("worker poll timeout has expired. This means the time between 
subsequent calls to poll() " +
+            "in DistributedHerder tick() method was longer than the configured 
rebalance.timeout.ms. " +
+            "If you see this happening consistently, then it can be addressed 
by either adding more workers " +
+            "to the connect cluster or by increasing the rebalance.timeout.ms 
configuration value. Please note that " +

Review Comment:
   I think this is decent advice when requests are small and can be distributed 
around the cluster, but as REST requests are rather infrequent, I think this is 
the minority of cases.
   
   I think most often this timeout is going to be triggered by an excessively 
slow connector start, stop, or validation. In those cases, adding more workers 
does nothing but move the error to a different worker. I think we can keep the 
"adding more workers" comment, if we include another piece of advice for 
debugging excessively blocking tasks. If we don't have that other piece of 
advice, then advising users to add workers is misleading.



##########
connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java:
##########
@@ -267,6 +267,18 @@ public String memberId() {
         return JoinGroupRequest.UNKNOWN_MEMBER_ID;
     }
 
+    @Override
+    protected void handlePollTimeoutExpiry() {

Review Comment:
   Since we (as maintainers) don't have good insight into what commonly causes 
the herder tick thread to block and the poll timeout to fire, we recently added 
https://issues.apache.org/jira/browse/KAFKA-15563 to help users debug it 
themselves.
   
   It would be nice to integrate with this system to have the heartbeat thread 
report what the herder tick thread was blocked on at the time that the poll 
timeout happened, as this would report stalling that isn't caused by REST 
requests.
   
   The integration is tricky though, because the WorkerCoordinator is (and 
should be) unaware of the DistributedHerder. And currently I think the 
WorkerCoordinator hides these internal disconnects and reconnects inside of the 
poll method. Perhaps we can extend the WorkerRebalanceListener or have a new 
error listener to allow the herder to be informed about these errors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to