[ 
https://issues.apache.org/jira/browse/KAFKA-9484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-9484:
-----------------------------------
    Description: 
Following the completion of the reassignment, the controller executes two 
steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr 
update (in any case) with the new target replica set; second, it removes 
unneeded replicas from the replica set and sends another round of LeaderAndIsr 
updates. I am doubting the need for the first round of updates in the case that 
the leader doesn't needed changing. 

For example, suppose we have the following reassignment state: 

replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10

First the controller will bump the epoch with the target replica set, which 
will result in a round of to the target replica set with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 

Immediately following this, the controller will bump the epoch again and remove 
the unneeded replica. This will result in another round of LeaderAndIsr 
requests with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[2,3,4], leader=2, epoch=12 

The first round of LeaderAndIsr updates puzzles me a bit. It is justified in 
the code with this comment: 

{code} 
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader 
from adding any replica in TRS - ORS back in the isr. 
{code} 
(I think the comment is backwards. It should be ORS (original replica set) - 
TRS (target replica set).) 

It sounds like we are trying to prevent a member of ORS from being added back 
to the ISR, but even if it did get added, it would be removed in the next step 
anyway. In the uncommon case that an ORS replica is out of sync, there does not 
seem to be any benefit to this first update since it is basically paying the 
cost of one write in order to save the speculative cost of one write. 
Additionally, it would be useful if the protocol could enforce the invariant 
that the ISR is always a subset of the replica set.

  was:
Following the completion of the reassignment, the controller executes two 
steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr 
update (in any case) with the new target replica set; second, it removes 
unneeded replicas from the replica set and sends another round of LeaderAndIsr 
updates. I am doubting the need for the first round of updates in the case that 
the leader doesn't needed changing. 

For example, suppose we have the following reassignment state: 

replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, epoch=10

First the controller will bump the epoch with the target replica set, which 
will result in a round of to the target replica set with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 

Immediately following this, the controller will bump the epoch again and remove 
the unneeded replica. This will result in another round of LeaderAndIsr 
requests with the following state: 

replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3], leader=2, epoch=12 

The first round of LeaderAndIsr updates puzzles me a bit. It is justified in 
the code with this comment: 

{code} 
B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader 
from adding any replica in TRS - ORS back in the isr. 
{code} 
(I think the comment is backwards. It should be ORS (original replica set) - 
TRS (target replica set).) 

It sounds like we are trying to prevent a member of ORS from being added back 
to the ISR, but even if it did get added, it would be removed in the next step 
anyway. In the uncommon case that an ORS replica is out of sync, there does not 
seem to be any benefit to this first update since it is basically paying the 
cost of one write in order to save the speculative cost of one write. 
Additionally, it would be useful if the protocol could enforce the invariant 
that the ISR is always a subset of the replica set.


> Unnecessary LeaderAndIsr update following reassignment completion
> -----------------------------------------------------------------
>
>                 Key: KAFKA-9484
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9484
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>
> Following the completion of the reassignment, the controller executes two 
> steps: first, it elects a new leader (if needed) and sends a LeaderAndIsr 
> update (in any case) with the new target replica set; second, it removes 
> unneeded replicas from the replica set and sends another round of 
> LeaderAndIsr updates. I am doubting the need for the first round of updates 
> in the case that the leader doesn't needed changing. 
> For example, suppose we have the following reassignment state: 
> replicas=[1,2,3,4], adding=[4], removing=[1], isr=[1,2,3,4], leader=2, 
> epoch=10
> First the controller will bump the epoch with the target replica set, which 
> will result in a round of to the target replica set with the following state: 
> replicas=[2,3,4], adding=[], removing=[], isr=[1,2,3,4], leader=2, epoch=11 
> Immediately following this, the controller will bump the epoch again and 
> remove the unneeded replica. This will result in another round of 
> LeaderAndIsr requests with the following state: 
> replicas=[2,3,4], adding=[], removing=[], isr=[2,3,4], leader=2, epoch=12 
> The first round of LeaderAndIsr updates puzzles me a bit. It is justified in 
> the code with this comment: 
> {code} 
> B3. Send a LeaderAndIsr request with RS = TRS. This will prevent the leader 
> from adding any replica in TRS - ORS back in the isr. 
> {code} 
> (I think the comment is backwards. It should be ORS (original replica set) - 
> TRS (target replica set).) 
> It sounds like we are trying to prevent a member of ORS from being added back 
> to the ISR, but even if it did get added, it would be removed in the next 
> step anyway. In the uncommon case that an ORS replica is out of sync, there 
> does not seem to be any benefit to this first update since it is basically 
> paying the cost of one write in order to save the speculative cost of one 
> write. Additionally, it would be useful if the protocol could enforce the 
> invariant that the ISR is always a subset of the replica set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to