[jira] [Comment Edited] (KAFKA-5395) Distributed Herder Deadlocks on Shutdown

Rajini Sivaram (JIRA) Wed, 07 Jun 2017 01:40:51 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040511#comment-16040511
 ]


Rajini Sivaram edited comment on KAFKA-5395 at 6/7/17 8:40 AM:
---------------------------------------------------------------

This was fixed in trunk in the commit: 
https://github.com/apache/kafka/commit/43524442dc10c5dc731248674eb1a811287e88f7.
 I will add the fix to 0.10.2 branch.


was (Author: rsivaram):
This was fixed in trunk, I will add the fix to 0.10.2 branch.

> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
>                 Key: KAFKA-5395
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5395
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 0.10.2.1
>            Reporter: Michael Jaschob
>            Assignee: Rajini Sivaram
>            Priority: Critical
>             Fix For: 0.11.0.0
>
>         Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process 
> does not shut down cleanly. It hangs instead. From what I can tell 
> [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
>  introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
>  on the AbstractCoordinator is marked as synchronized and acquires the 
> coordinator's monitor. The first thing it tries to do is 
> [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
>  the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same 
> monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
>  which it relinquishes when it 
> [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
>  But for the wait to return (and the run method of the heartbeat to 
> terminate) it needs to reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is 
> held by the distributed herder thread. And the distributed herder will never 
> relinquish the monitor since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in 
> particular of threads #178 (the heartbeat thread) and #159 (the herder 
> thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and 
> the latter is WAITING on the heartbeat thread to join, having itself acquired 
> 0x00000007406cc0c0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (KAFKA-5395) Distributed Herder Deadlocks on Shutdown

Reply via email to