[
https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142332#comment-17142332
]
Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------
It's imperative to understand that the storage is an _absolutely essential_
component of the broker's infrastructure. Without access to storage the broker
cannot persist messages when clients send them or remove messages when clients
acknowledge them. Without the storage the broker is basically useless.
Therefore, most storage errors are considered _critical_ and will cause the
broker to shut itself down.
In a shared-store configuration the storage represents a single point of
failure. If the shared storage fails then neither the master or the slave can
work properly. Therefore it's even more important to ensure that the shared
storage is robust & reliable.
I recently sent a fix for ARTEMIS-2807 which may be something you need. That
issue was observed in an environment where NFS had failed and the master was
not able to shut itself down all the way resulting in a hung state. The broker
process had to be killed eventually just like in your scenario. I'd need to see
thread dumps from the master broker after you kill the NFS server to confirm.
Once the NFS server is back the slave broker should, in theory, see that the
master is no longer holding its lock on the journal which means the slave can
activate and take over. It's not clear why that isn't happening in your
environment. I'd need to see thread dumps from the slave after you kill the NFS
server as well as after the NFS server has been restarted. It would also be
good to have the NFS mount options configured on both the master and the slave.
You say that after killing the master broker you tried to restart it but, "The
master did not start up at all." However, you provide no logging or other
details to help diagnose the issue so I can't really comment on that point. I'd
need to see logging from when you restarted the master to investigate further.
> Artemis HA with shared storage strategy does not reconnect with shared
> storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-2808
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.11.0
> Environment: Windows 10
> Reporter: Karan Aggarwal
> Priority: Blocker
>
> We verified the behavior of Artemis HA by bringing down the shared storage
> (VM) while run is in progress and here is the observation:
> *Scenario:*
> * When Artemis services are up and running and run is in progress we
> restarted the machine hosting the shared storage
> * Shared storage was back up in 5 mins
> * Both Artemis master and slave did not connect back to the shared storage
> * We tried stopping the Artemis brokers. The slave stopped, but the master
> did not stop. We had to kill the process.
> * We tried to start the Artemis brokers. The master did not start up at all.
> The slave started successfully.
> * We restarted the master Artemis server. Server started successfully and
> acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again
> every time shared storage connection goes down momentarily.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)