[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Apache Dev (Jira) Tue, 15 Dec 2020 10:30:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249853#comment-17249853
 ]


Apache Dev commented on ARTEMIS-3030:
-------------------------------------

Additional analysis done:
 * 
{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#startLockMonitoring}}
 is being executed only for Live lock when it is acquired by Master broker 
during its startup. But same monitoring should be done also for a backup broker 
which has acquired the Live lock when Master has failed.
 * Same monitoring mechanism should be implemented for Backup lock too. In the 
same scenario described in the issue, if more than a backup broker exist, NFS 
connection interruption can also cause two backup brokers to both acquire the 
backup lock
 * {{java.nio.channels.FileLock#isValid}} is not reliable with NFS locking in 
all scenarios we tested with iptables temporarily rejecting NFS packets from 
broker to NFS servers

As a possible solution for the lock validation problem, it is possible to use 
additional files containing the ID of the broker which holds the lock.

The following procedure must be applied for both Live lock and Backup lock. For 
example, in case of Live lock:
 # when broker acquires the lock, an additional file "lock.live.uuid" is 
written (such operation has to be done with write lock on such file) and its 
content is set to a unique ID of the broker
 # than broker starts a thread which monitors such file every x seconds (e.g. 2)
 # file "lock.live.uuid" also need to be read with write lock each time
 # when broker detects that content has changed, it means that the lock has 
been stolen, and broker stops (or restarts) in order to avoid the two active 
brokers

Such usage of a file (different from the one implementing the Live/Backup lock) 
having content to be read, and the use of write locks, seems to be reliable in 
order to force NFS client to communicate with NFS server, avoiding cached data.

> Journal lock evaluation fails when NFS is temporarily disconnected
> ------------------------------------------------------------------
>
>                 Key: ARTEMIS-3030
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3030
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.16.0
>            Reporter: Apache Dev
>            Priority: Blocker
>
> Same scenario of ARTEMIS-2421.
> If network between Live Broker (B1) and NFS Server is disconnected (for 
> example rejecting its TCP packets with iptables), after the lock lease 
> timeout this happens:
>  * Backup server (B2) becomes Live
>  * When NFS connectivity of B1 is restored, B1 remains Live
> So both broker are live.
> Issue seems caused by \{{java.nio.channels.FileLock#isValid}} used in 
> \{{org.apache.activemq.artemis.core.server.impl.FileLockNodeManager#isLiveLockLost}},
>  because it is always returning true, even if in the meanwhile the lock was 
> lost and taken by B2.
> Do you suggest to use specific mount options for NFS?
> Or the lock evaluation should be replaced with a more reliable mechanism? We 
> notice that \{{FileLock#isValid}} is returning a cached value (true), even 
> when NFS connectivity is down, so it would be better to use a validation 
> mechanism that forces querying the NFS server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-3030) Journal lock evaluation fails when NFS is temporarily disconnected

Reply via email to