[jira] [Commented] (ARTEMIS-2069) Backup doesn't activate after shared store is reconnected

ASF GitHub Bot (JIRA) Thu, 08 Nov 2018 10:23:08 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679519#comment-16679519
 ]


ASF GitHub Bot commented on ARTEMIS-2069:
-----------------------------------------

Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231822854
  
    --- Diff: 
artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java
 ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws 
IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && 
(System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. 
NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : 
Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a 
resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
    
    Now when I look at this again, this whole second loop in the _original 
code_ doesn't make sense - the only way execution could get here is when the 
```interrupted``` flag was set to true, in which case we should exit 
immediately. The comment mentions "deadlock exception", but any exception in 
the first loop would terminate the method.
    
    I'm gonna remove this second loop altogether.


> Backup doesn't activate after shared store is reconnected
> ---------------------------------------------------------
>
>                 Key: ARTEMIS-2069
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2069
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.6.2
>            Reporter: Tomas Hofman
>            Priority: Major
>
> *Scenario*
>  # Start live backup server pair in dedicated topology with shared store HA, 
> with journal located on NFS
>  # NFS mounted on backup server fails
>  # Reconnect NFS on backup server
>  # Try to shut down live EAP server
>  # Backup doesn't activate
> *What happens*
>  Backup is waiting for live to fail by checking its file lock. In case the 
> connection to shared storage fails, backup logs following error.
>  
> |{color:#000000}05:50:57,896 ERROR [org.apache.activemq.artemis.core.server] 
> (AMQ119000: Activation for server 
> ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475) 
> AMQ224000: Failure in initialisation: java.io.IOException: Input/output 
> error{color}|
> |{color:#000000} at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) 
> [rt.jar:1.8.0_151]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> |{color:#000000} at 
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496)
>  [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}|
> | |
>  
> Exception is caught in {{SharedStoreBackupActivation.run}}, and causes 
> termination of backup activation process.
> In case the NFS is reconnected later, backup server doesn't continue in 
> activation process and it doesn't wait for live to fail. In case the live 
> fails, backup doesn't activate, even though it has a connection to shared 
> storage.
> Backup should retry checking live lock even in case the storage is 
> unavailable. It should log warning/error messages that storage is 
> unavailable, but it should not terminate the activation process. This would 
> allow backup to continue its duties when the storage is reconnected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARTEMIS-2069) Backup doesn't activate after shared store is reconnected

Reply via email to