[ 
https://issues.apache.org/jira/browse/KAFKA-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruvil Shah updated KAFKA-9961:
--------------------------------
    Description: 
When completing a reassignment, the controller sends StopReplicaRequest to 
replicas that are not in the target assignment and removes them from the 
assignment in ZK. We do not have any retry mechanism to ensure that the broker 
is able to process the StopReplicaRequest successfully. Under certain 
circumstances, this could leave brokers in an inconsistent state, where they 
continue being the follower for this partition and end up with an inconsistent 
metadata cache.

We have seen messages like the following being spammed in the broker logs when 
we get into this situation:
{code:java}
While recording the replica LEO, the partition topic-1 hasn't been created.
{code}
This happens because the broker has neither received an updated 
LeaderAndIsrRequest for the new leader nor a StopReplicaRequest from the 
controller when the replica was removed from the assignment.

Note that we would require a restart of the affected broker to fix this 
situation. A controller failover would not fix it as the broker could continue 
being a replica for the partition until it receives a StopReplicaRequest, which 
would never happen in this case.

There seem to be couple of problems we should address:
 # We need a mechanism to retry replica deletions after partition reassignment 
is complete. The main challenge here is to be able to deal with cases where a 
broker has been decommissioned and may never come back up.
 # We could perhaps consider a mechanism to reconcile replica states across 
brokers, something similar to the solution proposed in 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker].

  was:
When completing a reassignment, the controller sends StopReplicaRequest to 
replicas that are not in the target assignment and removes them from the 
assignment in ZK. We do not have any retry mechanism to ensure that the broker 
is able to process the StopReplicaRequest successfully. Under certain 
circumstances, this could leave brokers in an inconsistent state, where they 
continue being the follower for this partition and end up with an inconsistent 
metadata cache.

We have seen messages like the following being spammed in the broker logs when 
we get into this situation:
{code:java}
While recording the replica LEO, the partition topic-1 hasn't been created.
{code}
This happens because the broker has not an updated LeaderAndIsrRequest for the 
new leader nor a StopReplicaRequest from the controller when the replica was 
removed from the assignment.

Note that we would require a restart of the affected broker to fix this 
situation. A controller failover would not fix it as the broker could continue 
being a replica for the partition until it receives a StopReplicaRequest, which 
would never happen in this case.

There seem to be couple of problems we should address:
 # We need a mechanism to retry replica deletions after partition reassignment 
is complete. The main challenge here is to be able to deal with cases where a 
broker has been decommissioned and may never come back up.
 # We could perhaps consider a mechanism to reconcile replica states across 
brokers, something similar to the solution proposed in 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker].


> Brokers may be left in an inconsistent state after reassignment
> ---------------------------------------------------------------
>
>                 Key: KAFKA-9961
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9961
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Dhruvil Shah
>            Priority: Major
>
> When completing a reassignment, the controller sends StopReplicaRequest to 
> replicas that are not in the target assignment and removes them from the 
> assignment in ZK. We do not have any retry mechanism to ensure that the 
> broker is able to process the StopReplicaRequest successfully. Under certain 
> circumstances, this could leave brokers in an inconsistent state, where they 
> continue being the follower for this partition and end up with an 
> inconsistent metadata cache.
> We have seen messages like the following being spammed in the broker logs 
> when we get into this situation:
> {code:java}
> While recording the replica LEO, the partition topic-1 hasn't been created.
> {code}
> This happens because the broker has neither received an updated 
> LeaderAndIsrRequest for the new leader nor a StopReplicaRequest from the 
> controller when the replica was removed from the assignment.
> Note that we would require a restart of the affected broker to fix this 
> situation. A controller failover would not fix it as the broker could 
> continue being a replica for the partition until it receives a 
> StopReplicaRequest, which would never happen in this case.
> There seem to be couple of problems we should address:
>  # We need a mechanism to retry replica deletions after partition 
> reassignment is complete. The main challenge here is to be able to deal with 
> cases where a broker has been decommissioned and may never come back up.
>  # We could perhaps consider a mechanism to reconcile replica states across 
> brokers, something similar to the solution proposed in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to