[ 
https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142332#comment-17142332
 ] 

Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------

It's imperative to understand that the storage is an _absolutely essential_ 
component of the broker's infrastructure. Without access to storage the broker 
cannot persist messages when clients send them or remove messages when clients 
acknowledge them. Without the storage the broker is basically useless. 
Therefore, most storage errors are considered _critical_ and will cause the 
broker to shut itself down.

In a shared-store configuration the storage represents a single point of 
failure. If the shared storage fails then neither the master or the slave can 
work properly. Therefore it's even more important to ensure that the shared 
storage is robust & reliable.

I recently sent a fix for ARTEMIS-2807 which may be something you need. That 
issue was observed in an environment where NFS had failed and the master was 
not able to shut itself down all the way resulting in a hung state. The broker 
process had to be killed eventually just like in your scenario. I'd need to see 
thread dumps from the master broker after you kill the NFS server to confirm.

Once the NFS server is back the slave broker should, in theory, see that the 
master is no longer holding its lock on the journal which means the slave can 
activate and take over. It's not clear why that isn't happening in your 
environment. I'd need to see thread dumps from the slave after you kill the NFS 
server as well as after the NFS server has been restarted. It would also be 
good to have the NFS mount options configured on both the master and the slave.

You say that after killing the master broker you tried to restart it but, "The 
master did not start up at all." However, you provide no logging or other 
details to help diagnose the issue so I can't really comment on that point. I'd 
need to see logging from when you restarted the master to investigate further.

> Artemis HA with shared storage strategy does not reconnect with shared 
> storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-2808
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Windows 10
>            Reporter: Karan Aggarwal
>            Priority: Blocker
>
> We verified the behavior of Artemis HA by bringing down the shared storage 
> (VM) while run is in progress and here is the observation: 
> *Scenario:*
>  * When Artemis services are up and running and run is in progress we 
> restarted the machine hosting the shared storage
>  * Shared storage was back up in 5 mins
>  * Both Artemis master and slave did not connect back to the shared storage
>  * We tried stopping the Artemis brokers. The slave stopped, but the master 
> did not stop. We had to kill the process.
>  * We tried to start the Artemis brokers. The master did not start up at all. 
> The slave started successfully.
>  * We restarted the master Artemis server. Server started successfully and 
> acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again 
> every time shared storage connection goes down momentarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to