Hi Tian,
Thanks for reporting this.
For this case, the recommended approach is to replace the failed peer directly
in a single reconfiguration, i.e. change the configuration from {R1, R2, R3} to
{R2, R3, R4}, instead of first adding R4 and then removing R1.
The reason is that Ratis uses joint consensus for membership change. During the
transitional phase, agreement requires separate majorities from both the old
and the new configurations. So if you first change {R1, R2, R3} to {R1, R2, R3,
R4} while R1 is already unavailable, the new side effectively needs R2, R3, and
R4 all available and caught up, which makes that path more fragile.
That said, adding R4 while R1 is down is still intended to be supported by
design. The staging logic does not wait for all existing followers to catch up.
It only waits for the bootstrapping peers that are not yet marked as caught up.
Also, the leader explicitly adds the new peer to the RPC peer set before
starting staging, so there should not be a circular dependency where R4 cannot
catch up because it is “not yet in the configuration”.
There are also tests covering this flow:
- bootstrapping new peers succeeds once the new peer(s) are started
- if the new peer(s) are not started, reconfiguration times out as expected
So based on the current code, this looks more like either:
1. a bug in the specific bootstrap path, or
2. an environment/setup issue, for example:
- R4 was not started with an empty group
- R4 has address/group-id mismatch
- leader cannot reach R4, or R4 cannot reply/append/install snapshot
successfully
If you can share the complete logs from the leader and from R4 during the
failed setConfiguration attempt, we can help look further.
Best,
Xinyu Tan
On 2026/06/04 07:59:00 Tian Jiang wrote:
> Sorry for the interruption.
>
> I cannot add R4, because staging a configuration change requires ALL
> followers to catch up, as shown in the code below.
>
> I am not sure whether it is a designed feature or something else. But I
> expected the reconfiguration to succeed as long as there are more than half
> health replicas.
> Is it possible to soften the restrition over reconfiguration?
> Best,
> Tian Jiang
>
>
>
>
>
>
> ---- Replied Message ----
> | From | Tian Jiang<[email protected]> |
> | Date | 6/4/2026 15:53 |
> | To | dev<[email protected]> |
> | Subject | Unable to reconfigure when unavailable peer present? |
>
>
> Dear developers:
>
>
> I am testing a case when one of the replicas fails and I want to remove it.
>
>
> Assuming we have 3 replicas: R1-R3 and R1 fails.
> I plan to add R4 into this replica group then remove R1, but fail at the
> first step.
>
>
> I cannot add R4.
>
>