[
https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040511#comment-16040511
]
Rajini Sivaram edited comment on KAFKA-5395 at 6/7/17 8:40 AM:
---------------------------------------------------------------
This was fixed in trunk in the commit:
https://github.com/apache/kafka/commit/43524442dc10c5dc731248674eb1a811287e88f7.
I will add the fix to 0.10.2 branch.
was (Author: rsivaram):
This was fixed in trunk, I will add the fix to 0.10.2 branch.
> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
> Key: KAFKA-5395
> URL: https://issues.apache.org/jira/browse/KAFKA-5395
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 0.10.2.1
> Reporter: Michael Jaschob
> Assignee: Rajini Sivaram
> Priority: Critical
> Fix For: 0.11.0.0
>
> Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process
> does not shut down cleanly. It hangs instead. From what I can tell
> [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
> introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
> on the AbstractCoordinator is marked as synchronized and acquires the
> coordinator's monitor. The first thing it tries to do is
> [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
> the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same
> monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
> which it relinquishes when it
> [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
> But for the wait to return (and the run method of the heartbeat to
> terminate) it needs to reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is
> held by the distributed herder thread. And the distributed herder will never
> relinquish the monitor since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in
> particular of threads #178 (the heartbeat thread) and #159 (the herder
> thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and
> the latter is WAITING on the heartbeat thread to join, having itself acquired
> 0x00000007406cc0c0.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)