[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Justin Bertram (Jira) Fri, 18 Dec 2020 09:27:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251906#comment-17251906
 ]


Justin Bertram commented on ARTEMIS-3030:
-----------------------------------------

I think a lot of your problems may come down to your NFS mount options. The JVM 
and therefore the broker are both at the mercy of the OS and the filesystem to 
report accurate data. NFS behaves different based on the mount options you use. 
This is what I would recommend for the shared-store use-case:

* *timeo=50* - NFS timeout of 5 seconds
* *retrans=1* - allows only one retry
* *soft* - soft mounting the NFS share disables the retry forever logic, 
allowing NFS errors to pop up into application stack after above timeouts
* *noac* - turns off caching of file attributes but also enforces a sync write 
to the NFS share. This also reduces the time for NFS errors to pop up.

Since the broker relies on a fast and responsive filesystem the goal is to make 
NFS fail quickly so that it can take the proper action. Are you using these 
mount options already or something else?

> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
>                 Key: ARTEMIS-3030
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.16.0
>            Reporter: Apache Dev
>            Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for 
> example rejecting its TCP packets with iptables), after the lock lease 
> timeout this happens:
>  * Backup server (B2) becomes Live
>  * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in 
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
>  because it is always returning true, even if in the meanwhile the lock was 
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We 
> notice that \{{FileLock#isValid}} is returning a cached value (true), even 
> when NFS connectivity is down, so it would be better to use a validation 
> mechanism that forces querying the NFS server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Reply via email to