[
https://issues.apache.org/jira/browse/ARTEMIS-2716?focusedWorklogId=602408&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-602408
]
ASF GitHub Bot logged work on ARTEMIS-2716:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 26/May/21 16:11
Start Date: 26/May/21 16:11
Worklog Time Spent: 10m
Work Description: franz1981 commented on pull request #3555:
URL: https://github.com/apache/activemq-artemis/pull/3555#issuecomment-848905780
Some notes on the existing test failures:
-
`org.apache.activemq.artemis.tests.smoke.quorum.PluggableQuorumSinglePairTest#testBackupFailoverAndPrimaryFailback`
can get a failing test using `forceKill=false` (that's using a gentle SIGTERM
to stop live primary): it's likely a bug/feature of `Atomix`while using Raft,
because if a killed live was a Raft leader too, a soft kill would make it able
to release its live lock and then crash, forcing a re-election and causing the
failing over backup to "reconsider" the ownership of the live lock just
obtained:
```
[main] 17:47:51,694 INFO
[org.apache.activemq.artemis.tests.smoke.quorum.PluggableQuorumSinglePairTest]
killing primary
**********************************
Killing server java.lang.UNIXProcess@674658f7
**********************************
atomixReplicationBackup-out:2021-05-26 17:47:51,706 WARN
[org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to
localhost/127.0.0.1:61616 has been detected: AMQ219015: The connection was
disconnected because of server shutdown [code=DISCONNECTED]
atomixReplicationBackup-out:2021-05-26 17:47:51,706 WARN
[org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to
localhost/127.0.0.1:61616 has been detected: AMQ219015: The connection was
disconnected because of server shutdown [code=DISCONNECTED]
atomixReplicationBackup-out:2021-05-26 17:47:51,765 INFO
[org.apache.activemq.artemis.quorum.atomix.AtomixDistributedLock] Failed to
acquire lock ac6f986d-be39-11eb-981a-8cc68169c75b
atomixReplicationBackup-out:2021-05-26 17:47:51,861 INFO
[org.apache.activemq.artemis.quorum.atomix.AtomixDistributedLock] Acquired lock
ac6f986d-be39-11eb-981a-8cc68169c75b with version Version{version=10}
atomixReplicationBackup-out:2021-05-26 17:47:51,863 INFO
[org.apache.activemq.artemis.core.server] AMQ221037:
ActiveMQServerImpl::serverUUID=ac6f986d-be39-11eb-981a-8cc68169c75b to become
'live'
atomixReplicationBackup-out:2021-05-26 17:47:51,872 WARN
[org.apache.activemq.artemis.core.client] AMQ212004: Failed to connect to
server.
atomixReplicationBackup-out:2021-05-26 17:47:51,974 INFO
[org.apache.activemq.artemis.core.server] AMQ221080: Deploying address
exampleTopic supporting [MULTICAST]
atomixReplicationBackup-out:2021-05-26 17:47:51,974 INFO
[org.apache.activemq.artemis.core.server] AMQ221080: Deploying address
exampleQueue supporting [ANYCAST]
atomixReplicationBackup-out:2021-05-26 17:47:51,975 INFO
[org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue
exampleQueue on address exampleQueue
atomixReplicationBackup-out:2021-05-26 17:47:52,164 WARN
[org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation]
org.apache.activemq.artemis.quorum.UnavailableStateException:
io.atomix.primitive.PrimitiveException: java.net.ConnectException: Failed to
connect to the cluster
atomixReplicationBackup-out:2021-05-26 17:47:52,164 ERROR
[org.apache.activemq.artemis.core.server] AMQ224000: Failure in initialisation:
ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=This server
cannot check its role as a live: activation is failed]
atomixReplicationBackup-out: at
org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation.startAsLive(ReplicationBackupActivation.java:194)
[artemis-server-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
atomixReplicationBackup-out: at
org.apache.activemq.artemis.core.server.impl.ReplicationBackupActivation.run(ReplicationBackupActivation.java:157)
[artemis-server-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
atomixReplicationBackup-out: at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:4240)
[artemis-server-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
atomixReplicationBackup-out:
atomixReplicationBackup-out:2021-05-26 17:47:55,018 WARN
[io.atomix.protocols.raft.roles.FollowerRole]
RaftServer{data-partition-1}{role=FOLLOWER} -
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: localhost/127.0.0.1:7777
atomixReplicationBackup-out:2021-05-26 17:47:55,019 INFO
[io.atomix.protocols.raft.impl.RaftContext] RaftServer{data-partition-1} -
Transitioning to CANDIDATE
atomixReplicationBackup-out:2021-05-26 17:47:55,021 INFO
[io.atomix.protocols.raft.roles.CandidateRole]
RaftServer{data-partition-1}{role=CANDIDATE} - Starting election
atomixReplicationBackup-out:2021-05-26 17:47:55,032 WARN
[io.atomix.protocols.raft.roles.CandidateRole]
RaftServer{data-partition-1}{role=CANDIDATE} -
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..)
failed: Connection refused: localhost/127.0.0.1:7777
atomixReplicationBackup-out:2021-05-26 17:47:55,034 INFO
[io.atomix.protocols.raft.impl.RaftContext] RaftServer{data-partition-1} -
Transitioning to LEADER
atomixReplicationBackup-out:2021-05-26 17:47:55,041 INFO
[io.atomix.protocols.raft.impl.RaftContext] RaftServer{data-partition-1} -
Found leader backup
[raft-server-data-partition-1] 17:47:55,052 INFO
[io.atomix.protocols.raft.impl.RaftContext] RaftServer{data-partition-1} -
Found leader backup
[main] 17:47:55,283 INFO
[org.apache.activemq.artemis.tests.smoke.quorum.PluggableQuorumSinglePairTest]
killed primary
```
It shows that
```
atomixReplicationBackup-out:2021-05-26 17:47:51,765 INFO
[org.apache.activemq.artemis.quorum.atomix.AtomixDistributedLock] Failed to
acquire lock ac6f986d-be39-11eb-981a-8cc68169c75b
atomixReplicationBackup-out:2021-05-26 17:47:51,861 INFO
[org.apache.activemq.artemis.quorum.atomix.AtomixDistributedLock] Acquired lock
ac6f986d-be39-11eb-981a-8cc68169c75b with version Version{version=10}
```
live lock is acquired by the failing over backup, but
```
atomixReplicationBackup-out:2021-05-26 17:47:52,164 ERROR
[org.apache.activemq.artemis.core.server] AMQ224000: Failure in initialisation:
ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=This server
cannot check its role as a live: activation is failed]
```
The lock is not considered available and the failover process is stopped.
The critical part seems:
```
atomixReplicationBackup-out:2021-05-26 17:47:55,034 INFO
[io.atomix.protocols.raft.impl.RaftContext] RaftServer{data-partition-1} -
Transitioning to LEADER
```
It shows that backup has become the RAFT leader (given that live isn't
around and that's ok, but: why backup hasn't suicide itself given that it's not
able to start as a live? This maybe need some attention.
-
`org.apache.activemq.artemis.tests.smoke.quorum.ZookeeperPluggableQuorumSinglePairTest`
is hitting some thread leaks on `ListenerHandler` threads
eg
```
Thread Thread[ListenerHandler-/127.0.0.1:38567,5,main] is still alive with
the following stackTrace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 602408)
Time Spent: 5h (was: 4h 50m)
> Implements pluggable Quorum Vote
> --------------------------------
>
> Key: ARTEMIS-2716
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2716
> Project: ActiveMQ Artemis
> Issue Type: New Feature
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Attachments: backup.png, primary.png
>
> Time Spent: 5h
> Remaining Estimate: 0h
>
> This task aim to ideliver a new Quorum Vote mechanism for artemis with the
> objectives:
> # to make it pluggable
> # to cleanly separate the election phase and the cluster member states
> # to simplify most common setups in both amount of configuration and
> requirements (eg "witness" nodes could be implemented to support single
> master-slave pairs)
> Post-actions to help people adopt it, but need to be thought upfront:
> # a clean upgrade path for current HA replication users
> # deprecate or integrate the current HA replication into the new version
--
This message was sent by Atlassian Jira
(v8.3.4#803005)