[
https://issues.apache.org/jira/browse/ARTEMIS-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441434#comment-16441434
]
Catalin Alexandru Zamfir edited comment on ARTEMIS-1285 at 4/17/18 8:21 PM:
----------------------------------------------------------------------------
The example is simple. Our set-up involves jgroups TCPPING discovery with
"initial_hosts" set to the live nodes only and in order to avoid Jgroups 'no
physical address for node: UUID" we have set "send_cache_on_join" and
"return_entire_cache" to true on the TCPPING setup in jgroups.
Below is the master/slave configurations (identical for the backups).
{code:java}
... Jgroups TCPPING discovery/broadcast configuration above ...
Working cluster, tested with ./artemis producer/consumer CLI commands from
different nodes on different physical machines.
... on master (live)
<ha-policy>
<replication>
<master>
<check-for-live-server>true</check-for-live-server>
<group-name>g1</group-name>
<initial-replication-sync-timeout>15000</initial-replication-sync-timeout>
<cluster-name>shared-artemis-cluster</cluster-name>
<vote-on-replication-failure>true</vote-on-replication-failure>
</master>
</replication>
</ha-policy>
... on replicas (2x)
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
<group-name>g1</group-name>
<initial-replication-sync-timeout>15000</initial-replication-sync-timeout>
<cluster-name>shared-artemis-cluster</cluster-name>
<vote-retries>12</vote-retries>
<vote-retry-wait>5000</vote-retry-wait>
</slave>
</replication>
</ha-policy>{code}
Note that I repeated the fresh install several times (using Ansible, docker and
fresh LVs on LVM, everything is purged and reinstalled). Every fresh install,
"r3" enters the loop. But any manual intervention (eg.restart of r3) makes it
behave normally (staying in standby state until some manual stop of the master
is done, at which point it becomes backup for r2).
Looks like some sort of cluster "intial state" conflict. Maybe related to
TCPPING + jgroups in this setup. Sadly I can't use multicast (UDP) in our
network to provide a different behaviour for comparison. I'm on all 3 hawtio
management consoles when installing the fresh cluster. One reports live, the
other bacup, the 3rd gives exceptions 'Broker is stopped' when trying to view
any attributes.
It's late for me, taking this for a spin tomorrow. If I can provide any more
information, please ask. Thanks!
was (Author: antauri):
The example is simple. Our set-up involves jgroups TCPPING discovery with
"initial_hosts" set to the live nodes only and in order to avoid Jgroups 'no
physical address for node: UUID" we have set "send_cache_on_join" and
"return_entire_cache" to true on the TCPPING setup in jgroups.
Below is the master/slave configurations (identical for the backups).
{code:java}
... Jgroups TCPPING discovery/broadcast configuration above ...
Working cluster, tested with ./artemis producer/consumer CLI commands from
different nodes on different physical machines.
... on master (live)
<ha-policy>
<replication>
<master>
<check-for-live-server>true</check-for-live-server>
<group-name>g1</group-name>
<initial-replication-sync-timeout>15000</initial-replication-sync-timeout>
<cluster-name>shared-artemis-cluster</cluster-name>
<vote-on-replication-failure>true</vote-on-replication-failure>
</master>
</replication>
</ha-policy>
... on replicas (2x)
<ha-policy>
<replication>
<slave>
<allow-failback>true</allow-failback>
<group-name>g1</group-name>
<initial-replication-sync-timeout>15000</initial-replication-sync-timeout>
<cluster-name>shared-artemis-cluster</cluster-name>
<vote-retries>12</vote-retries>
<vote-retry-wait>5000</vote-retry-wait>
</slave>
</replication>
</ha-policy>{code}
> Standby slave would not announce replication to master when the slave is down
> -----------------------------------------------------------------------------
>
> Key: ARTEMIS-1285
> URL: https://issues.apache.org/jira/browse/ARTEMIS-1285
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.1.0
> Reporter: yangwei
> Priority: Major
>
> We have a cluster of 3 instances: A is master, B is slave and C is standby
> slave. When slave is down, we expect C announces replication to A but A is in
> standalone mode all the time. We see C waits at "nodeLocator.locateNode()"
> through jstack command.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)