[ 
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418751#comment-17418751
 ] 

Justin Bertram edited comment on ARTEMIS-3030 at 12/2/24 9:42 PM:
------------------------------------------------------------------

We have been hitting the same issue on QA and production. In our simulated 
tests (temporarily dropping the NIC on one of the servers) the NFS mount 
options suggested in here resolved the issue.

Prior mount options - where we were able to reproduce:
{noformat}
(rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.8.170,local_lock=none,addr=10.0.9.8){noformat}
After mount options where at least our simulated test worked as expected:
{noformat}
(rw,relatime,sync,vers=4.1,rsize=131072,wsize=131072,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=10.0.8.170,local_lock=none,addr=10.0.9.8){noformat}
We will be applying these changes to all environments, so I will update if we 
hit the issue again afterwards.


was (Author: sebaker):
We have been hitting the same issue on QA and production. In our simulated 
tests (temporarily dropping the nic on one of the servers) the nfs mount 
options suggested in 
https://issues.apache.org/jira/browse/ARTEMIS-3030?focusedCommentId=17251906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17251906
 resolved the issue.

 

Prior mount options - where we were able to reproduce:

(rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.8.170,local_lock=none,addr=10.0.9.8)

 

After mount options where at least our simulated test worked as expected:

(rw,relatime,sync,vers=4.1,rsize=131072,wsize=131072,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=10.0.8.170,local_lock=none,addr=10.0.9.8)

 

We will be applying these changes to all environments, so I will update if we 
hit the issue again afterwards.

> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
>                 Key: ARTEMIS-3030
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.16.0
>            Reporter: Apache Dev
>            Assignee: Francesco Nigro
>            Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for 
> example rejecting its TCP packets with iptables), after the lock lease 
> timeout this happens:
>  * Backup server (B2) becomes Live
>  * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in 
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
>  because it is always returning true, even if in the meanwhile the lock was 
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We 
> notice that \{{FileLock#isValid}} is returning a cached value (true), even 
> when NFS connectivity is down, so it would be better to use a validation 
> mechanism that forces querying the NFS server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to