[jira] [Commented] (ARTEMIS-2808) Artemis HA with shared storage strategy does not reconnect with shared storage if reconnection happens at shared storage

Justin Bertram (Jira) Mon, 10 Aug 2020 13:02:12 -0700


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175032#comment-17175032
 ]


Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------

Thanks for the logs and thread dumps. From what I can tell the broker is 
behaving reasonably, although the outcome clearly isn't what's expected or 
desired.

In scenario #1 the backup broker is happily waiting to get the lock on the 
journal. This thread is in the thread dumps before and after NFS goes down:
{noformat}
"AMQ229000: Activation for server 
ActiveMQServerImpl::serverUUID=5451042e-b0c6-11ea-80b3-005056979868" #32 prio=5 
os_prio=0 tid=0x000000001a439800 nid=0x9ac waiting on condition 
[0x000000001ca1f000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:403)
        at 
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
        at 
org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
        at 
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
{noformat}
And the log contains entries like this once NFS goes down:
{noformat}
2020-06-23 08:29:02,362 DEBUG 
[org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when 
accessing a lock file: java.io.IOException: An unexpected network error occurred
        at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_251]
        at sun.nio.ch.FileDispatcherImpl.lock(Unknown Source) [rt.jar:1.8.0_251]
        at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source) [rt.jar:1.8.0_251]
        at java.nio.channels.FileChannel.tryLock(Unknown Source) 
[rt.jar:1.8.0_251]
        at 
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351)
 [artemis-server-2.13.0.jar:2.13.0]
        at 
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370)
 [artemis-server-2.13.0.jar:2.13.0]
        at 
org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
 [artemis-server-2.13.0.jar:2.13.0]
        at 
org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
 [artemis-server-2.13.0.jar:2.13.0]
        at 
org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
 [artemis-server-2.13.0.jar:2.13.0]
{noformat}
Unfortunately it appears that even after NFS is restored the broker still 
encounters those exceptions which indicates that something is going wrong at 
the JVM/OS level. It looks like once NFS is restored the file handle for the 
lock is stale. It may be necessary for the broker to re-create the file-handle 
from scratch.

That same basic thing appears to be happening in scenario #2 as well.


> Artemis HA with shared storage strategy does not reconnect with shared 
> storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-2808
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Windows 10
>            Reporter: Karan Aggarwal
>            Priority: Blocker
>         Attachments: Scenario_1.zip, Scenario_2.zip
>
>
> We verified the behavior of Artemis HA by bringing down the shared storage 
> (VM) while run is in progress and here is the observation: 
> *Scenario:*
>  * When Artemis services are up and running and run is in progress we 
> restarted the machine hosting the shared storage
>  * Shared storage was back up in 5 mins
>  * Both Artemis master and slave did not connect back to the shared storage
>  * We tried stopping the Artemis brokers. The slave stopped, but the master 
> did not stop. We had to kill the process.
>  * We tried to start the Artemis brokers. The master did not start up at all. 
> The slave started successfully.
>  * We restarted the master Artemis server. Server started successfully and 
> acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again 
> every time shared storage connection goes down momentarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2808) Artemis HA with shared storage strategy does not reconnect with shared storage if reconnection happens at shared storage

Reply via email to