[
https://issues.apache.org/jira/browse/ARTEMIS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679558#comment-16679558
]
ASF GitHub Bot commented on ARTEMIS-2069:
-----------------------------------------
Github user TomasHofman commented on a diff in the pull request:
https://github.com/apache/activemq-artemis/pull/2287#discussion_r231830652
--- Diff:
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
---
@@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws
IOException {
protected FileLock lock(final long lockPosition) throws Exception {
long start = System.currentTimeMillis();
+ boolean isRecurringFailure = false;
while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
- try {
- Thread.sleep(500);
- } catch (InterruptedException e) {
- return null;
- }
-
- if (lockAcquisitionTimeout != -1 &&
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
- throw new Exception("timed out waiting for lock");
+ try {
+ FileLock lock = tryLock(lockPosition);
+ isRecurringFailure = false;
+
+ if (lock == null) {
+ logger.debug("lock is null");
+ try {
+ Thread.sleep(500);
+ } catch (InterruptedException e) {
+ return null;
+ }
+
+ if (lockAcquisitionTimeout != -1 &&
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+ throw new Exception("timed out waiting for lock");
+ }
+ } else {
+ return lock;
}
- } else {
- return lock;
+ } catch (IOException e) {
+ // IOException during trylock() may be a temporary issue, e.g.
NFS volume not being accessible
+ logger.log(isRecurringFailure ? Logger.Level.DEBUG :
Logger.Level.WARN,
+ "Failure when accessing a lock file", e);
+ isRecurringFailure = true;
--- End diff --
Modified to exit if timeout already reached, and don't sleep longer then
remaining time to timeout.
> Backup doesn't activate after shared store is reconnected
> ---------------------------------------------------------
>
> Key: ARTEMIS-2069
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2069
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.6.2
> Reporter: Tomas Hofman
> Priority: Major
>
> *Scenario*
> # Start live backup server pair in dedicated topology with shared store HA,
> with journal located on NFS
> # NFS mounted on backup server fails
> # Reconnect NFS on backup server
> # Try to shut down live EAP server
> # Backup doesn't activate
> *What happens*
> Backup is waiting for live to fail by checking its file lock. In case the
> connection to shared storage fails, backup logs following error.
>
> |{color:#000000}05:50:57,896 ERROR [org.apache.activemq.artemis.core.server]
> (AMQ119000: Activation for server
> ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475)
> AMQ224000: Failure in initialisation: java.io.IOException: Input/output
> error{color}|
> |{color:#000000} at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at
> sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at
> sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299)
> [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316)
> [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127)
> [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at
> org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
> [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496)
> [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> | |
>
> Exception is caught in {{SharedStoreBackupActivation.run}}, and causes
> termination of backup activation process.
> In case the NFS is reconnected later, backup server doesn't continue in
> activation process and it doesn't wait for live to fail. In case the live
> fails, backup doesn't activate, even though it has a connection to shared
> storage.
> Backup should retry checking live lock even in case the storage is
> unavailable. It should log warning/error messages that storage is
> unavailable, but it should not terminate the activation process. This would
> allow backup to continue its duties when the storage is reconnected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)