[
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249853#comment-17249853
]
Apache Dev commented on ARTEMIS-3030:
-------------------------------------
Additional analysis done:
*
{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#startLockMonitoring}}
is being executed only for Live lock when it is acquired by Master broker
during its startup. But same monitoring should be done also for a backup broker
which has acquired the Live lock when Master has failed.
* Same monitoring mechanism should be implemented for Backup lock too. In the
same scenario described in the issue, if more than a backup broker exist, NFS
connection interruption can also cause two backup brokers to both acquire the
backup lock
* {{java.nio.channels.FileLock#isValid}} is not reliable with NFS locking in
all scenarios we tested with iptables temporarily rejecting NFS packets from
broker to NFS servers
As a possible solution for the lock validation problem, it is possible to use
additional files containing the ID of the broker which holds the lock.
The following procedure must be applied for both Live lock and Backup lock. For
example, in case of Live lock:
# when broker acquires the lock, an additional file "lock.live.uuid" is
written (such operation has to be done with write lock on such file) and its
content is set to a unique ID of the broker
# than broker starts a thread which monitors such file every x seconds (e.g. 2)
# file "lock.live.uuid" also need to be read with write lock each time
# when broker detects that content has changed, it means that the lock has
been stolen, and broker stops (or restarts) in order to avoid the two active
brokers
Such usage of a file (different from the one implementing the Live/Backup lock)
having content to be read, and the use of write locks, seems to be reliable in
order to force NFS client to communicate with NFS server, avoiding cached data.
> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
> Key: ARTEMIS-3030
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.16.0
> Reporter: Apache Dev
> Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for
> example rejecting its TCP packets with iptables), after the lock lease
> timeout this happens:
> * Backup server (B2) becomes Live
> * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
> because it is always returning true, even if in the meanwhile the lock was
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We
> notice that \{{FileLock#isValid}} is returning a cached value (true), even
> when NFS connectivity is down, so it would be better to use a validation
> mechanism that forces querying the NFS server.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)