[ https://issues.apache.org/jira/browse/KAFKA-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379070#comment-16379070 ]
ASF GitHub Bot commented on KAFKA-6593: --------------------------------------- hachikuji opened a new pull request #4625: KAFKA-6593 [WIP]; Fix livelock with consumer heartbeat thread in commitSync URL: https://github.com/apache/kafka/pull/4625 Contention for the lock in ConsumerNetworkClient can lead to a livelock situation in which an active commitSync is unable to make progress because its completion is blocked in the heartbeat thread. The fix is twofold: 1) We change ConsumerNetworkClient to use a fair lock to reduce the chance of each thread getting starved. 2) We eliminate the dependence on the lock in ConsumerNetworkClient for callback completion so that callbacks will not be blocked by an active poll(). I've left this as a WIP patch since I am still considering test cases. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Coordinator disconnect in heartbeat thread can cause commitSync to block > indefinitely > ------------------------------------------------------------------------------------- > > Key: KAFKA-6593 > URL: https://issues.apache.org/jira/browse/KAFKA-6593 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 1.0.0, 0.11.0.2 > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Major > Fix For: 1.1.0 > > Attachments: consumer.log > > > If a coordinator disconnect is observed in the heartbeat thread, it can cause > a pending offset commit to be cancelled just before the foreground thread > begins waiting on its response in poll(). Since the poll timeout is > Long.MAX_VALUE, this will cause the consumer to effectively hang until some > other network event causes the poll() to return. We try to protect this case > with a poll condition on the future, but this isn't bulletproof since the > future can be completed outside of the lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)