[
https://issues.apache.org/jira/browse/ARTEMIS-2716?focusedWorklogId=602870&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-602870
]
ASF GitHub Bot logged work on ARTEMIS-2716:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 27/May/21 10:50
Start Date: 27/May/21 10:50
Worklog Time Spent: 10m
Work Description: franz1981 edited a comment on pull request #3555:
URL: https://github.com/apache/activemq-artemis/pull/3555#issuecomment-849440671
Update on
> why backup hasn't committed suicide given that it's not able to start as a
live?
Assuming a correct `Atomix` behaviour It seems related to a bug on my side:
```java
private void startAsLive(final DistributedLock liveLock) throws Exception
{
// ...
// IMPORTANT:
// we're setting this activation JUST because it would allow the
server to use its
// getActivationChannelHandler to handle replication
final ReplicationPrimaryActivation primaryActivation = new
ReplicationPrimaryActivation(activeMQServer, distributedManager,
policy.getLivePolicy());
liveLock.addListener(primaryActivation);
activeMQServer.setActivation(primaryActivation);
activeMQServer.initialisePart2(false);
final boolean stillLive;
try {
stillLive = liveLock.isHeldByCaller();
} catch (UnavailableStateException e) {
LOGGER.warn(e);
throw new ActiveMQIllegalStateException("This server cannot
check its role as a live: activation is failed");
}
if (!stillLive) {
throw new ActiveMQIllegalStateException("This server is not live
anymore: activation is failed");
}
// ...
```
If the quorum is lost before `liveLock.addListener(primaryActivation)`, the
current `AtomixDistributedLock::onStateChanged`:
```java
private void onStateChanged(PrimitiveState state) {
LOGGER.info(state);
switch (state) {
case SUSPENDED:
case EXPIRED:
case CLOSED:
for (LockListener listener : listeners) {
listener.stateChanged(LockListener.EventType.UNAVAILABLE);
}
break;
}
}
```
It's going to find empty `listeners` and
`ReplicationPrimaryActivation::stateChanged` won't be called to async stop the
server.
The late check `liveLock.isHeldByCaller` should fail and throw an exception,
but it won't cause the server to stop.
In short, the issue is that `liveLock.isHeldByCaller` must be able to stop
the server: on primary activation this seems to happen due to
https://issues.apache.org/jira/browse/ARTEMIS-388's
`activeMQServer.callActivationFailureListeners(e)` that's not used on backup
activation.
I'm investigating why `activeMQServer.callActivationFailureListeners(e);`
isn't used in any of the existing backup activations (shared nothing, shared
store...) and if it won't be a viable option I'm going to call
`AtomixDistributedLock::onStateChanged` in case of a lost lock, before throwing
the exception, to async stop server.
This same timing issue could happen with Zookeeper too, so it worth to be
fixed.
Re the Atomix issue instead, I'm going to investigate a bit more what's
going on, because it doesn't seem that `AtomixDistributedLock::onStateChanged`
has been called while ` liveLock.isHeldByCaller` has failed, that seems a bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 602870)
Time Spent: 6h (was: 5h 50m)
> Implements pluggable Quorum Vote
> --------------------------------
>
> Key: ARTEMIS-2716
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
> Project: ActiveMQ Artemis
> Issue Type: New Feature
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Attachments: backup.png, primary.png
>
> Time Spent: 6h
> Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and
> requirements (eg "witness" nodes could be implemented to support single
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version
--
This message was sent by Atlassian Jira
(v8.3.4#803005)