[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128797#comment-16128797
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---------------------------------------

GitHub user dguy opened a pull request:

    https://github.com/apache/kafka/pull/3675

    KAFKA-5152: perform state restoration in poll loop

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dguy/kafka kafka-5152

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/3675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3675
    
----
commit 4da235927f9eebf2442533b44bf017454d372747
Author: Damian Guy <damian....@gmail.com>
Date:   2017-08-14T17:23:53Z

    perform state restoration in poll loop

----


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-5152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5152
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.1
>            Reporter: Xavier Léauté
>            Assignee: Damian Guy
>            Priority: Blocker
>             Fix For: 0.11.0.1, 1.0.0
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to