ASF GitHub Bot commented on KAFKA-5152:

GitHub user dguy opened a pull request:


    KAFKA-5152: move state restoration out of rebalance and into poll loop

    In `onPartitionsAssigned`: 
    1. release all locks for non-assigned suspended tasks.
    2. resume any suspended tasks.
    3. Create new tasks, but don't attempt to take the state lock.
    4. Pause partitions for any new tasks.
    5. set the state to `PARTITIONS_ASSIGNED`
    In `StreamThread#runLoop`
    1. poll
    2. if state is `PARTITIONS_ASSIGNED`
     2.1  attempt to initialize any new tasks, i.e, take out the state locks 
and init state stores
     2.2 restore some data for changelogs, i.e., poll once on the restore 
consumer and return the partitions that have been fully restored
     2.3 update tasks with restored partitions and move any that have completed 
restoration to running
     2.4 resume consumption for any tasks where all partitions have been 
     2.5 if all active tasks are running, transition to `RUNNING` and assign 
standby partitions to the restoreConsumer.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dguy/kafka 0.11.0-restore-on-poll

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3653
commit 27016b9e9706ee95bcedd9a1408c71e62a0f178e
Author: Damian Guy <damian....@gmail.com>
Date:   2017-08-09T19:02:17Z

    restore state on the poll loop


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> ------------------------------------------------------------------------------
>                 Key: KAFKA-5152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5152
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions:
>            Reporter: Xavier Léauté
>            Assignee: Damian Guy
>            Priority: Blocker
>             Fix For:,
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.

This message was sent by Atlassian JIRA

Reply via email to