[
https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175032#comment-17175032
]
Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------
Thanks for the logs and thread dumps. From what I can tell the broker is
behaving reasonably, although the outcome clearly isn't what's expected or
desired.
In scenario #1 the backup broker is happily waiting to get the lock on the
journal. This thread is in the thread dumps before and after NFS goes down:
{noformat}
"AMQ229000: Activation for server
ActiveMQServerImpl::serverUUID=5451042e-b0c6-11ea-80b3-005056979868" #32 prio=5
os_prio=0 tid=0x000000001a439800 nid=0x9ac waiting on condition
[0x000000001ca1f000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:403)
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
at
org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
{noformat}
And the log contains entries like this once NFS goes down:
{noformat}
2020-06-23 08:29:02,362 DEBUG
[org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when
accessing a lock file: java.io.IOException: An unexpected network error occurred
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_251]
at sun.nio.ch.FileDispatcherImpl.lock(Unknown Source) [rt.jar:1.8.0_251]
at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source) [rt.jar:1.8.0_251]
at java.nio.channels.FileChannel.tryLock(Unknown Source)
[rt.jar:1.8.0_251]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351)
[artemis-server-2.13.0.jar:2.13.0]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370)
[artemis-server-2.13.0.jar:2.13.0]
at
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
[artemis-server-2.13.0.jar:2.13.0]
at
org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
[artemis-server-2.13.0.jar:2.13.0]
at
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
[artemis-server-2.13.0.jar:2.13.0]
{noformat}
Unfortunately it appears that even after NFS is restored the broker still
encounters those exceptions which indicates that something is going wrong at
the JVM/OS level. It looks like once NFS is restored the file handle for the
lock is stale. It may be necessary for the broker to re-create the file-handle
from scratch.
That same basic thing appears to be happening in scenario #2 as well.
> Artemis HA with shared storage strategy does not reconnect with shared
> storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-2808
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.11.0
> Environment: Windows 10
> Reporter: Karan Aggarwal
> Priority: Blocker
> Attachments: Scenario_1.zip, Scenario_2.zip
>
>
> We verified the behavior of Artemis HA by bringing down the shared storage
> (VM) while run is in progress and here is the observation:
> *Scenario:*
> * When Artemis services are up and running and run is in progress we
> restarted the machine hosting the shared storage
> * Shared storage was back up in 5 mins
> * Both Artemis master and slave did not connect back to the shared storage
> * We tried stopping the Artemis brokers. The slave stopped, but the master
> did not stop. We had to kill the process.
> * We tried to start the Artemis brokers. The master did not start up at all.
> The slave started successfully.
> * We restarted the master Artemis server. Server started successfully and
> acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again
> every time shared storage connection goes down momentarily.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)