Francesco Nigro created ARTEMIS-3430:
----------------------------------------

             Summary: Activation Sequence Auto-Repair
                 Key: ARTEMIS-3430
                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3430
             Project: ActiveMQ Artemis
          Issue Type: Bug
            Reporter: Francesco Nigro
            Assignee: Francesco Nigro


This can be seen both as a bug or an improvement over the existing self-heal 
behaviour of activation sequence introduced by 
https://issues.apache.org/jira/browse/ARTEMIS-3340.

In short, the existing protocol to increase activation sequence while 
un-replicated is:
# remote i -> -(i + 1) ie remote CLAIM 
# local i -> (i + 1) ie local commit
# remote -(i + 1) -> (i + 1) ie remote COMMIT

This protocol has been designed to allow witness brokers to acknowledge if 
their data is no longer up-to-date and to save them to throw away it, if it 
still have some value (because of a failure to commit sequence).

In the current version, self-repairing is allowed only if the live broker has 
performed 2. but not 3. ie local activation sequence is updated, but 
coordinated one isn't committed.
If the failing broker is restarted it can "fix" the coordinated sequence and 
move on to become live again, but if 2. fail (or just never happen), the 
coordinated activation sequence cannot be fixed if not with some admin 
intervention, after inspecting local activation sequences.

The reason why other brokers cannot "fix" the sequence is because the local 
sequence of the failed broker is unknown and just roll-backing the claimed one 
can makes the failed broker to believe to have up-to-date data too, causing 
journal misalignments.

The solution to this can be to fix claimed sequence, forbidding any broker to 
run un-replicated with it by further increasing it *after* repaired: this would 
age forcibly brokers with "goodish" data, but will allow others brokers to 
auto-repair without admin intervention.
The sole drawback of this strategy is that a further fail of the repairing 
broker while further increasing sequence will give it full and exclusive 
responsibility to auto-repair, because no other brokers can have an high-enough 
local sequence.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to