[ https://issues.apache.org/jira/browse/ARTEMIS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678252#comment-16678252 ]
ASF GitHub Bot commented on ARTEMIS-2069: ----------------------------------------- Github user franz1981 commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { - try { - Thread.sleep(500); - } catch (InterruptedException e) { - return null; - } - - if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + try { + FileLock lock = tryLock(lockPosition); + isRecurringFailure = false; + + if (lock == null) { + logger.debug("lock is null"); + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } + } else { + return lock; } - } else { - return lock; + } catch (IOException e) { + // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible + logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, + "Failure when accessing a lock file", e); + isRecurringFailure = true; + Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME); } } // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works, // need to investigate further and review - FileLock lock; + FileLock lock = null; --- End diff -- Same thing as the comment above. > Backup doesn't activate after shared store is reconnected > --------------------------------------------------------- > > Key: ARTEMIS-2069 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2069 > Project: ActiveMQ Artemis > Issue Type: Bug > Affects Versions: 2.6.2 > Reporter: Tomas Hofman > Priority: Major > > *Scenario* > # Start live backup server pair in dedicated topology with shared store HA, > with journal located on NFS > # NFS mounted on backup server fails > # Reconnect NFS on backup server > # Try to shut down live EAP server > # Backup doesn't activate > *What happens* > Backup is waiting for live to fail by checking its file lock. In case the > connection to shared storage fails, backup logs following error. > > |{color:#000000}05:50:57,896 ERROR [org.apache.activemq.artemis.core.server] > (AMQ119000: Activation for server > ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475) > AMQ224000: Failure in initialisation: java.io.IOException: Input/output > error{color}| > |{color:#000000} at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > | | > > Exception is caught in {{SharedStoreBackupActivation.run}}, and causes > termination of backup activation process. > In case the NFS is reconnected later, backup server doesn't continue in > activation process and it doesn't wait for live to fail. In case the live > fails, backup doesn't activate, even though it has a connection to shared > storage. > Backup should retry checking live lock even in case the storage is > unavailable. It should log warning/error messages that storage is > unavailable, but it should not terminate the activation process. This would > allow backup to continue its duties when the storage is reconnected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)