[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Justin Bertram (Jira) Tue, 05 Jan 2021 14:27:41 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259199#comment-17259199
 ]


Justin Bertram commented on ARTEMIS-3030:
-----------------------------------------

bq. However, our tests confirmed that FileLock#isValid is not reliable: when 
connection to NFS is interrupted, it always returns true. This also happens, 
during NFS disconnection, when hard mounting is used, which should instead 
block calls when NFS is not reachable.

It appears [the 
code|https://github.com/apache/activemq-artemis/blob/master/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java#L507]
 already accounts for this because it has 2 different checks along with this 
comment:

{code:java}
// Java always thinks the lock is still valid even when there is no filesystem
// so we do another check
{code}

The other check involves reading the "state" from the shared lock file. If 
reading the state fails then the lock is considered lost. Given your report it 
appears that reading the state is successful which is puzzling given the NFS 
mount is disconnected. I'm not sure if caching might play a role here. Can you 
provide your full NFS mount options?

Also, for what it's worth, the nodes _already_ write their UUID to the 
{{server.lock}} file.

> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
>                 Key: ARTEMIS-3030
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.16.0
>            Reporter: Apache Dev
>            Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for 
> example rejecting its TCP packets with iptables), after the lock lease 
> timeout this happens:
>  * Backup server (B2) becomes Live
>  * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in 
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
>  because it is always returning true, even if in the meanwhile the lock was 
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We 
> notice that \{{FileLock#isValid}} is returning a cached value (true), even 
> when NFS connectivity is down, so it would be better to use a validation 
> mechanism that forces querying the NFS server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Reply via email to