[jira] [Commented] (ARTEMIS-2421) Implement periodic journal lock evaluation

ASF subversion and git services (Jira) Tue, 26 Nov 2019 18:06:34 -0800


    [ 
https://issues.apache.org/jira/browse/ARTEMIS-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983090#comment-16983090
 ]


ASF subversion and git services commented on ARTEMIS-2421:
----------------------------------------------------------

Commit e12f3ddc6fe6a36013764008b1c5288c52cd6fda in activemq-artemis's branch 
refs/heads/master from Bas Elzinga
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=e12f3dd ]

ARTEMIS-2421 periodic journal lock evaluation

If a broker loses its file lock on the journal and doesn't notice (e.g.
network connection failure to an NFS mount) then it can continue to run
after its backup activates resulting in split-brain.

This commit implements periodic journal lock evaluation so that if a live
server loses its lock it will automatically restart itself.


> Implement periodic journal lock evaluation
> ------------------------------------------
>
>                 Key: ARTEMIS-2421
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2421
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.6.4
>            Reporter: Gaurav
>            Assignee: Justin Bertram
>            Priority: Critical
>         Attachments: broker_master.xml, broker_slave.xml
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We have Live-Backup server configuration, single instance of Artemis Live 
> server (2.6.4 version) backed up by single instance of Backup server.
> Using shared file system as persistent storage.
> Please refer attachments for both Live-Backup broker configuration.
> *Fail Over Scenario*
>  # Node 1 acting as Live node and serving requests whereas Node 2 acting as 
> standby or passive node. No consumer is connected to these nodes
>  # Pushed 5 messages and verify message count as 5
>  # Perform NIC (Network) failure on Node 1 server ( i.e. Cluster is now 
> unable to connect to Node 1) . This will make Node 2 as Active and we are 
> also able to see previous 5 messages (pushed in step 2) successfully 
> replicated on Node 2
>  # Bring the network connection back for Node 1. This is where we are facing 
> issues as now both nodes acting as Live nodes and getting continuous error as 
> below:
> {quote}{{{color:#FF0000}AMQ212034: There are more than one servers on the 
> network broadcasting the same node id. You will see this message exactly once 
> (per node) if a node is restarted, in which case it can be safely ignored. 
> But if it is logged continuously it means you really do have more than one 
> node on the same network active concurrently with the same node id. This 
> could occur  if you have a backup node active at the same time as its live 
> node. nodeID=cd323206-4adc-11e9-814b-506b8d4ee653{color}}}
>  
> {quote}
> This situation bring entire cluster in inconsistent state and able to push 
> messages on both the nodes.
> Any pointer on this issue is much appreciated!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARTEMIS-2421) Implement periodic journal lock evaluation

Reply via email to