[jira] [Commented] (ARTEMIS-2069) Backup doesn't activate after shared store is reconnected

ASF GitHub Bot (JIRA) Wed, 07 Nov 2018 06:04:27 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678252#comment-16678252
 ]


ASF GitHub Bot commented on ARTEMIS-2069:
-----------------------------------------

Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243
  
    --- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a 
resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
    
    Same thing as the comment above.


> Backup doesn't activate after shared store is reconnected
> ---------------------------------------------------------
>
>                 Key: ARTEMIS-2069
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2069
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>            Reporter: Tomas Hofman
>            Priority: Major
>
> *Scenario*
>  # Start live backup server pair in dedicated topology with shared store HA, 
> with journal located on NFS
>  # NFS mounted on backup server fails
>  # Reconnect NFS on backup server
>  # Try to shut down live EAP server
>  # Backup doesn't activate
> *What happens*
>  Backup is waiting for live to fail by checking its file lock. In case the 
> connection to shared storage fails, backup logs following error.
>  
> |{color:#000000}05:50:57,896 ERROR [org.apache.activemq.artemis.core.server] 
> (AMQ119000: Activation for server 
> ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475) 
> AMQ224000: Failure in initialisation: java.io.IOException: Input/output 
> error{color}|
> |{color:#000000} at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> | |
>  
> Exception is caught in {{SharedStoreBackupActivation.run}}, and causes 
> termination of backup activation process.
> In case the NFS is reconnected later, backup server doesn't continue in 
> activation process and it doesn't wait for live to fail. In case the live 
> fails, backup doesn't activate, even though it has a connection to shared 
> storage.
> Backup should retry checking live lock even in case the storage is 
> unavailable. It should log warning/error messages that storage is 
> unavailable, but it should not terminate the activation process. This would 
> allow backup to continue its duties when the storage is reconnected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARTEMIS-2069) Backup doesn't activate after shared store is reconnected

Reply via email to