[ 
https://issues.apache.org/jira/browse/KAFKA-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899851#comment-16899851
 ] 

Ivan Yurchenko commented on KAFKA-7941:
---------------------------------------

We got hit by this issue as well. In our case it makes a Connect cluster 
totally non-operating, when a {{WorkerCoordinator}} can't create assignments 
because it can't read the latest connector config from Kafka and coordinators 
get into an infinite loop of
{noformat}
INFO [Worker clientId=connect-1, groupId=connect] Was selected to perform 
assignments, but do not have latest config found in sync request. Returning an 
empty configuration to trigger re-sync. 
(org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:208)
INFO [GroupCoordinator 3]: Assignment received from leader for group connect 
for generation 436 (kafka.coordinator.group.GroupCoordinator)
INFO [Worker clientId=connect-1, groupId=connect] Successfully joined group 
with generation 436 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator:455)
INFO Joined group and got assignment: Assignment{error=1, 
leader='connect-1-caf0b504-cb29-4456-a28d-3172cdf67d73', 
leaderUrl='http://test-xps7h6wknyd-3.aiven.local:8083/', offset=1, 
connectorIds=[], taskIds=[]} 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:1216)
INFO [Worker clientId=connect-1, groupId=connect] (Re-)joining group 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator:491)
INFO [GroupCoordinator 3]: Preparing to rebalance group connect in state 
PreparingRebalance with old generation 436 (__consumer_offsets-30) (reason: 
Updating metadata for member connect-1-caf0b504-cb29-4456-a28d-3172cdf67d73) 
(kafka.coordinator.group.GroupCoordinator)
INFO [GroupCoordinator 3]: Stabilized group connect generation 437 
(__consumer_offsets-30) (kafka.coordinator.group.GroupCoordinator)
{noformat}
Thank your for reporting and fixing, [~pgwhalen].

> Connect KafkaBasedLog work thread terminates when getting offsets fails 
> because broker is unavailable
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7941
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7941
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Paul Whalen
>            Assignee: Paul Whalen
>            Priority: Minor
>
> My team has run into this Connect bug regularly in the last six months while 
> doing infrastructure maintenance that causes intermittent broker availability 
> issues.  I'm a little surprised it exists given how routinely it affects us, 
> so perhaps someone in the know can point out if our setup is somehow just 
> incorrect.  My team is running 2.0.0 on both the broker and client, though 
> from what I can tell from reading the code, the issue continues to exist 
> through 2.2; at least, I was able to write a failing unit test that I believe 
> reproduces it.
> When a {{KafkaBasedLog}} worker thread in the Connect runtime calls 
> {{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from 
> the consumer {{endOffsets}} call is uncaught all the way up to the top level 
> {{catch (Throwable t)}}, effectively killing the thread until restarting 
> Connect.  The result is Connect stops functioning entirely, with no 
> indication except for that log line - tasks still show as running.
> The proposed fix is to simply catch and log the {{TimeoutException}}, 
> allowing the worker thread to retry forever.
> Alternatively, perhaps there is not an expectation that Connect should be 
> able to recover following broker unavailability, though that would be 
> disappointing.  I would at least hope hope for a louder failure then the 
> single {{ERROR}} log.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to