[ 
https://issues.apache.org/jira/browse/ARTEMIS-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Austermühle updated ARTEMIS-3606:
-----------------------------------------
    Description: 
We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, hosted 
on Kubernetes (non-cloud) and use the JGroups {{KUBE_PING}} for the broker 
discovery. During regular operations, we have 2 primaries and 2 replica brokers 
and everything looks fine.

For testing, we now remove the replica instances (no Pods left) – and end up 
with a weird cluster state: 2 primaries – and 1 zombie replica connected to 
primary 1. The replica instances were shut down (scaling the corresponding 
StatefulSet to zero), i.e., no hard kill.

Restarting the replicas brings the cluster back to a normal state – sometimes.

[According to the 
docs|https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#discovery-groups],
 the missing broker instances should be removed:
{quote}If it has not received a broadcast from a particular server for a length 
of time it will remove that server's entry from its list.
{quote}
The broker also includes the zombie replicas in its topology update to JMS 
clients resulting in about >30 connection attempts per second in our case. 
Since Kubernetes does not know the shutdown replica broker instances anymore, 
the client’s name resolution ends with {{{}Cannot resolve host{}}}. This 
finally leads to the client eating a whole CPU core on connection attempts and 
logging the failure.

By the way, the JMS client should wait for some time after a {{Cannot resolve 
host}} exception instead of retrying immediately. Looks like the pause 
parameters retryInterval, retryIntervalMultiplier, maxRetryInterval, and 
reconnectAttempts don’t have any effect.

[~brusdev] commented:
{quote}logs confirm no messages from replicas so the issue isn't caused by 
jgroups, it could be due to a bug on propagating cluster topology updates. The 
cluster topology updates are sent using ClusterTopologyChangeMessage.
{quote}
Please see [https://stackoverflow.com/q/70288344/6529100] for additional 
information, logs, configuration.

Maybe it is worth mentioning that shutting down a primary (master) instance 
works as expected.

  was:
We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, hosted 
on Kubernetes (non-cloud) and use the JGroups {{KUBE_PING}} for the broker 
discovery. During regular operations, we have 2 primaries and 2 replica brokers 
and everything looks fine.

For testing, we now remove the replica instances (no Pods left) – and end up 
with a weird cluster state: 2 primaries – and 1 zombie replica connected to 
primary 1. The replica instances were shut down (scaling the corresponding 
StatefulSet to zero), i.e., no hard kill.

Restarting the replicas brings the cluster back to a normal state – sometimes.

[According to the 
docs|https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#discovery-groups],
 the missing broker instances should be removed:
{quote}If it has not received a broadcast from a particular server for a length 
of time it will remove that server's entry from its list.
{quote}
The broker also includes the zombie replicas in its topology update to JMS 
clients resulting in about >30 connection attempts per second in our case. 
Since Kubernetes does not know the shutdown replica broker instances anymore, 
the client’s name resolution ends with {{{}Cannot resolve host{}}}. This 
finally leads to the client eating a whole CPU core on connection attempts and 
logging the failure.

By the way, the JMS client should wait for some time after a {{Cannot resolve 
host}} exception instead of retrying immediately. Looks like the pause 
parameters retryInterval,
retryIntervalMultiplier, maxRetryInterval, and reconnectAttempts don’t have any 
effect.

[~brusdev] commented:
{quote}logs confirm no messages from replicas so the issue isn't caused by 
jgroups, it could be due to a bug on propagating cluster topology updates. The 
cluster topology updates are sent using ClusterTopologyChangeMessage.
{quote}
Please see [https://stackoverflow.com/q/70288344/6529100] for additional 
information, logs, configuration.

Maybe it is worth mentioning that shutting down a primary (master) instance 
works as expected.


> Broker does not discard absent replica instances
> ------------------------------------------------
>
>                 Key: ARTEMIS-3606
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3606
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.19.0
>            Reporter: Stephan Austermühle
>            Priority: Major
>
> We have deployed ActiveMQ Artemis v2.19.0 in a HA+Cluster configuration, 
> hosted on Kubernetes (non-cloud) and use the JGroups {{KUBE_PING}} for the 
> broker discovery. During regular operations, we have 2 primaries and 2 
> replica brokers and everything looks fine.
> For testing, we now remove the replica instances (no Pods left) – and end up 
> with a weird cluster state: 2 primaries – and 1 zombie replica connected to 
> primary 1. The replica instances were shut down (scaling the corresponding 
> StatefulSet to zero), i.e., no hard kill.
> Restarting the replicas brings the cluster back to a normal state – sometimes.
> [According to the 
> docs|https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#discovery-groups],
>  the missing broker instances should be removed:
> {quote}If it has not received a broadcast from a particular server for a 
> length of time it will remove that server's entry from its list.
> {quote}
> The broker also includes the zombie replicas in its topology update to JMS 
> clients resulting in about >30 connection attempts per second in our case. 
> Since Kubernetes does not know the shutdown replica broker instances anymore, 
> the client’s name resolution ends with {{{}Cannot resolve host{}}}. This 
> finally leads to the client eating a whole CPU core on connection attempts 
> and logging the failure.
> By the way, the JMS client should wait for some time after a {{Cannot resolve 
> host}} exception instead of retrying immediately. Looks like the pause 
> parameters retryInterval, retryIntervalMultiplier, maxRetryInterval, and 
> reconnectAttempts don’t have any effect.
> [~brusdev] commented:
> {quote}logs confirm no messages from replicas so the issue isn't caused by 
> jgroups, it could be due to a bug on propagating cluster topology updates. 
> The cluster topology updates are sent using ClusterTopologyChangeMessage.
> {quote}
> Please see [https://stackoverflow.com/q/70288344/6529100] for additional 
> information, logs, configuration.
> Maybe it is worth mentioning that shutting down a primary (master) instance 
> works as expected.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to