[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-12-10 Thread TomasHofman
Github user TomasHofman commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r240294762
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -49,6 +49,8 @@
 
private static final byte NOT_STARTED = 'N';
 
+   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
--- End diff --

I added another test to verify that `lockAcquisitionTimeout` is honored.

The only hard waiting time is 500ms on line 313, but that's the original 
behavior. I can make that honor `lockAcquisitionTimeout` as well if desired.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-12-10 Thread TomasHofman
Github user TomasHofman commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r240293147
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -49,6 +49,8 @@
 
private static final byte NOT_STARTED = 'N';
 
+   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
--- End diff --

@clebertsuconic no, the `lockAcquisitionTimeout` is not ignored.

If you check the line 337, actual waiting time is either 
`LOCK_ACCESS_FAILURE_WAIT_TIME` or time remaining until 
`lockAcquisitionTimeout` runs out, whichever is lower, so acquisition timeout 
set by user is honored.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-12-10 Thread clebertsuconic
Github user clebertsuconic commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r240232734
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -49,6 +49,8 @@
 
private static final byte NOT_STARTED = 'N';
 
+   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
--- End diff --

You are here effectively ignoring the acquire timeout where the user can 
configure it. I'm not sure this is correct.

your test should play with a configured timeout and see if you get the 
expected behaviour. Adding this you are forcing your own timeout bypassing the 
configured one.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-08 Thread TomasHofman
Github user TomasHofman commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231878791
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
+
+long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
+if (lockAcquisitionTimeout != -1) {
+   final long remainingTime = lockAcquisitionTimeout - 
(System.currentTimeMillis() - start);
+   if (remainingTime <= 0) {
+  throw new Exception("timed out waiting for lock");
+   }
+   waitTime = Collections.min(Arrays.asList(waitTime, 
remainingTime));
--- End diff --

Good point, updated.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-08 Thread franz1981
Github user franz1981 commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231870842
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
+
+long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
+if (lockAcquisitionTimeout != -1) {
+   final long remainingTime = lockAcquisitionTimeout - 
(System.currentTimeMillis() - start);
+   if (remainingTime <= 0) {
+  throw new Exception("timed out waiting for lock");
+   }
+   waitTime = Collections.min(Arrays.asList(waitTime, 
remainingTime));
--- End diff --

`Math.min` will do the same as `Collections.min(Arrays.asList(waitTime, 
remainingTime));`


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-08 Thread TomasHofman
Github user TomasHofman commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231830652
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
-}
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   logger.debug("lock is null");
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
- } else {
-return lock;
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
--- End diff --

Modified to exit if timeout already reached, and don't sleep longer then 
remaining time to timeout.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-08 Thread TomasHofman
Github user TomasHofman commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231822854
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
-}
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   logger.debug("lock is null");
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
- } else {
-return lock;
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
+Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
  }
   }
 
   // todo this is here because sometimes channel.lock throws a 
resource deadlock exception but trylock works,
   // need to investigate further and review
-  FileLock lock;
+  FileLock lock = null;
--- End diff --

Now when I look at this again, this whole second loop in the _original 
code_ doesn't make sense - the only way execution could get here is when the 
```interrupted``` flag was set to true, in which case we should exit 
immediately. The comment mentions "deadlock exception", but any exception in 
the first loop would terminate the method.

I'm gonna remove this second loop altogether.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-07 Thread franz1981
Github user franz1981 commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231511486
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
-}
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   logger.debug("lock is null");
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
- } else {
-return lock;
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
--- End diff --

This would continue to attempt to lock regardless the 
`lockAcquisitionTimeout` (if any): i suppose it would be more robust if you 
will check the timeout anyway. There are tests that are using that timeout and 
I suppose it should be honored


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-11-07 Thread franz1981
Github user franz1981 commented on a diff in the pull request:

https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243
  
--- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
@@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
 
protected FileLock lock(final long lockPosition) throws Exception {
   long start = System.currentTimeMillis();
+  boolean isRecurringFailure = false;
 
   while (!interrupted) {
- FileLock lock = tryLock(lockPosition);
-
- if (lock == null) {
-try {
-   Thread.sleep(500);
-} catch (InterruptedException e) {
-   return null;
-}
-
-if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
-   throw new Exception("timed out waiting for lock");
+ try {
+FileLock lock = tryLock(lockPosition);
+isRecurringFailure = false;
+
+if (lock == null) {
+   logger.debug("lock is null");
+   try {
+  Thread.sleep(500);
+   } catch (InterruptedException e) {
+  return null;
+   }
+
+   if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
+  throw new Exception("timed out waiting for lock");
+   }
+} else {
+   return lock;
 }
- } else {
-return lock;
+ } catch (IOException e) {
+// IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
+logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
+"Failure when accessing a lock file", e);
+isRecurringFailure = true;
+Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
  }
   }
 
   // todo this is here because sometimes channel.lock throws a 
resource deadlock exception but trylock works,
   // need to investigate further and review
-  FileLock lock;
+  FileLock lock = null;
--- End diff --

Same thing as the comment above.


---


[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

2018-09-03 Thread TomasHofman
GitHub user TomasHofman opened a pull request:

https://github.com/apache/activemq-artemis/pull/2287

ARTEMIS-2069 Backup doesn't activate after shared store is reconnected

https://issues.apache.org/jira/browse/ARTEMIS-2069
https://issues.jboss.org/browse/WFLY-10968
https://issues.jboss.org/browse/JBEAP-15343

Fix tries to prevent a server activation thread from terminating when 
FileLockNodeManager.tryLock() throws an IOException, e.g. because temporarily 
inaccessible NFS directory.
The node manager will repeat tryLock() call every two seconds.
WARN message with stack trace will be printed on first failure, DEBUG 
messages will be printed on recurring failures.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/TomasHofman/activemq-artemis 
JBEAP-14032-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/activemq-artemis/pull/2287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2287


commit 162e71816e54137534c0bc8b2c3d6c85f941917d
Author: Tomas Hofman 
Date:   2018-09-03T13:47:03Z

ARTEMIS-2069 Backup doesn't activate after shared store is reconnected




---