[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Apache Dev (Jira) Tue, 22 Dec 2020 07:10:15 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253557#comment-17253557
 ]


Apache Dev commented on ARTEMIS-3030:
-------------------------------------

Thanks [~jbertram],
We tested the provided mount options, distinguishing also between {{soft}} and 
{{hard}} mounting.
However, our tests confirmed that {{FileLock#isValid}} is not reliable: when 
connection to NFS is interrupted, it always returns {{true}}. This also 
happens, during NFS disconnection, when {{hard}} mounting is used, which should 
instead block calls when NFS is not reachable.
Notice also that, using {{soft}}, it happens incidentally that broker realizes 
that an I/O error occurs (see below) when executing the FileStoreMonitor, and 
this avoids that, when NFS connection is restored, the duplicate active broker 
problem occurs. However, this should not be considered the right way to detect 
lost locks.
{code}
[12/9/20 16:34:50:071 CET] 00000081 org.apache.activemq.artemis.core.server W 
AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO 
Error while calculating disk usage
java.nio.file.FileSystemException: /opt/shared/activemq-data/journal: 
Input/output error
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:103)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:114)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:119)
at sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:123)
at sun.nio.fs.UnixFileStore.getUsableSpace(UnixFileStore.java:136)
at 
org.apache.activemq.artemis.core.server.files.FileStoreMonitor.tick(FileStoreMonitor.java:104)
at 
org.apache.activemq.artemis.core.server.files.FileStoreMonitor.run(FileStoreMonitor.java:93)
at 
org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent.runForExecutor(ActiveMQScheduledComponent.java:313)
{code}

> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
>                 Key: ARTEMIS-3030
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.16.0
>            Reporter: Apache Dev
>            Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for 
> example rejecting its TCP packets with iptables), after the lock lease 
> timeout this happens:
>  * Backup server (B2) becomes Live
>  * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in 
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
>  because it is always returning true, even if in the meanwhile the lock was 
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We 
> notice that \{{FileLock#isValid}} is returning a cached value (true), even 
> when NFS connectivity is down, so it would be better to use a validation 
> mechanism that forces querying the NFS server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Reply via email to