[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user TomasHofman commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r240294762 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -49,6 +49,8 @@ private static final byte NOT_STARTED = 'N'; + private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000; --- End diff -- I added another test to verify that `lockAcquisitionTimeout` is honored. The only hard waiting time is 500ms on line 313, but that's the original behavior. I can make that honor `lockAcquisitionTimeout` as well if desired. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user TomasHofman commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r240293147 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -49,6 +49,8 @@ private static final byte NOT_STARTED = 'N'; + private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000; --- End diff -- @clebertsuconic no, the `lockAcquisitionTimeout` is not ignored. If you check the line 337, actual waiting time is either `LOCK_ACCESS_FAILURE_WAIT_TIME` or time remaining until `lockAcquisitionTimeout` runs out, whichever is lower, so acquisition timeout set by user is honored. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user clebertsuconic commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r240232734 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -49,6 +49,8 @@ private static final byte NOT_STARTED = 'N'; + private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000; --- End diff -- You are here effectively ignoring the acquire timeout where the user can configure it. I'm not sure this is correct. your test should play with a configured timeout and see if you get the expected behaviour. Adding this you are forcing your own timeout bypassing the configured one. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user TomasHofman commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231878791 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible + +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; + +long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME; +if (lockAcquisitionTimeout != -1) { + final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start); + if (remainingTime <= 0) { + throw new Exception("timed out waiting for lock"); + } + waitTime = Collections.min(Arrays.asList(waitTime, remainingTime)); --- End diff -- Good point, updated. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user franz1981 commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231870842 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible + +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; + +long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME; +if (lockAcquisitionTimeout != -1) { + final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start); + if (remainingTime <= 0) { + throw new Exception("timed out waiting for lock"); + } + waitTime = Collections.min(Arrays.asList(waitTime, remainingTime)); --- End diff -- `Math.min` will do the same as `Collections.min(Arrays.asList(waitTime, remainingTime));` ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user TomasHofman commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231830652 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; -} - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + logger.debug("lock is null"); + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - } else { -return lock; + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; --- End diff -- Modified to exit if timeout already reached, and don't sleep longer then remaining time to timeout. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user TomasHofman commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231822854 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; -} - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + logger.debug("lock is null"); + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - } else { -return lock; + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; +Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME); } } // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works, // need to investigate further and review - FileLock lock; + FileLock lock = null; --- End diff -- Now when I look at this again, this whole second loop in the _original code_ doesn't make sense - the only way execution could get here is when the ```interrupted``` flag was set to true, in which case we should exit immediately. The comment mentions "deadlock exception", but any exception in the first loop would terminate the method. I'm gonna remove this second loop altogether. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user franz1981 commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231511486 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; -} - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + logger.debug("lock is null"); + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - } else { -return lock; + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; --- End diff -- This would continue to attempt to lock regardless the `lockAcquisitionTimeout` (if any): i suppose it would be more robust if you will check the timeout anyway. There are tests that are using that timeout and I suppose it should be honored ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
Github user franz1981 commented on a diff in the pull request: https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243 --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java --- @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { -try { - Thread.sleep(500); -} catch (InterruptedException e) { - return null; -} - -if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + try { +FileLock lock = tryLock(lockPosition); +isRecurringFailure = false; + +if (lock == null) { + logger.debug("lock is null"); + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } +} else { + return lock; } - } else { -return lock; + } catch (IOException e) { +// IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible +logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, +"Failure when accessing a lock file", e); +isRecurringFailure = true; +Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME); } } // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works, // need to investigate further and review - FileLock lock; + FileLock lock = null; --- End diff -- Same thing as the comment above. ---
[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...
GitHub user TomasHofman opened a pull request: https://github.com/apache/activemq-artemis/pull/2287 ARTEMIS-2069 Backup doesn't activate after shared store is reconnected https://issues.apache.org/jira/browse/ARTEMIS-2069 https://issues.jboss.org/browse/WFLY-10968 https://issues.jboss.org/browse/JBEAP-15343 Fix tries to prevent a server activation thread from terminating when FileLockNodeManager.tryLock() throws an IOException, e.g. because temporarily inaccessible NFS directory. The node manager will repeat tryLock() call every two seconds. WARN message with stack trace will be printed on first failure, DEBUG messages will be printed on recurring failures. You can merge this pull request into a Git repository by running: $ git pull https://github.com/TomasHofman/activemq-artemis JBEAP-14032-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/activemq-artemis/pull/2287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2287 commit 162e71816e54137534c0bc8b2c3d6c85f941917d Author: Tomas Hofman Date: 2018-09-03T13:47:03Z ARTEMIS-2069 Backup doesn't activate after shared store is reconnected ---