[ 
https://issues.apache.org/jira/browse/KAFKA-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770714#comment-16770714
 ] 

ASF GitHub Bot commented on KAFKA-7941:
---------------------------------------

pgwhalen commented on pull request #6283: KAFKA-7941: Catch TimeoutException in 
KafkaBasedLog worker thread
URL: https://github.com/apache/kafka/pull/6283
 
 
    - When calling readLogToEnd(), the KafkaBasedLog worker thread should
    catch TimeoutException and log a warning, which can occur if brokers
    are unavailable, otherwise the worker thread terminates.
    - Includes an enhancement to MockConsumer that allows simulating
    exceptions not just when polling but also when querying for offsets,
    which is necessary for testing the fix.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Connect KafkaBasedLog work thread terminates when getting offsets fails 
> because broker is unavailable
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7941
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7941
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Paul Whalen
>            Assignee: Paul Whalen
>            Priority: Minor
>
> My team has run into this Connect bug regularly in the last six months while 
> doing infrastructure maintenance that causes intermittent broker availability 
> issues.  I'm a little surprised it exists given how routinely it affects us, 
> so perhaps someone in the know can point out if our setup is somehow just 
> incorrect.  My team is running 2.0.0 on both the broker and client, though 
> from what I can tell from reading the code, the issue continues to exist 
> through 2.2; at least, I was able to write a failing unit test that I believe 
> reproduces it.
> When a {{KafkaBasedLog}} worker thread in the Connect runtime calls 
> {{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from 
> the consumer {{endOffsets}} call is uncaught all the way up to the top level 
> {{catch (Throwable t)}}, effectively killing the thread until restarting 
> Connect.  The result is Connect stops functioning entirely, with no 
> indication except for that log line - tasks still show as running.
> The proposed fix is to simply catch and log the {{TimeoutException}}, 
> allowing the worker thread to retry forever.
> Alternatively, perhaps there is not an expectation that Connect should be 
> able to recover following broker unavailability, though that would be 
> disappointing.  I would at least hope hope for a louder failure then the 
> single {{ERROR}} log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to