I am unsure if this should actually be filed as a bug, at least with the scope 
that I described. I’ve tested this on the good cluster. The results so far are:

When node 01 is both the rabbitmq-master and active resource manager, and the 
VM is paused, everything goes down.

When node 01 is the rabbitmq-master, but node 02 contains the active resource 
manager, and node 02 is paused, the other two resource managers become active, 
but all the workers are down. This does not auto-resolve when the node is 
unpaused, but creates a separate rabbitmq-partition.

The only way that I can imagine the initially described behavior is when the 
hypervisor is rapidly pausing and unpausing. That way, the rabbitmq and mongodb 
clusters could both stay intact, while the workers and resource manager slowly 
miss their heartbeats.



________________________________
Sebastian Sonne
Systems & Applications (OSA)

noris network AG
Thomas-Mann-Strasse 16−20
90471 Nürnberg
Deutschland

Tel +49 911 9352 1184
Fax +49 911 9352 100

[email protected]

https://www.noris.de - Mehr Leistung als Standard
Vorstand: Ingo Kraupa (Vorsitzender), Joachim Astel, Jürgen Städing
Vorsitzender des Aufsichtsrats: Stefan Schnabel - AG Nürnberg HRB 17689









Am 10.01.2018 um 14:25 schrieb Dennis Kliban 
<[email protected]<mailto:[email protected]>>:

It sounds like you may be experiencing issue https://pulp.plan.io/issues/3135

From our conversation on IRC, I learned that the hypervisor is acting up and 
the VMs pause from time to time. So even though the system is not under heavy 
load it still behaves as though it is. As a result the INactive resource 
managers think that the active resource manager has become inactive and start 
being active. What I am still not clear on is why more than 1 resource manager 
is able to become active at a time. If this is actually happening, then this is 
a new bug. You could avoid this problem by only running 2 resource managers. 
Though it would be good to find a reliable way to reproduce this problem and 
file a bug.

On Wed, Jan 10, 2018 at 6:37 AM, Sebastian Sonne 
<[email protected]<mailto:[email protected]>> wrote:
Hello everyone.

I have two pulp clusters, each containing three nodes, all systems are up to 
date (pulp 2.14.3). However, the cluster behavior differs greatly. Let's call 
the working cluster the external one, and the broken one internal.

The setup: Everything is virtualized. Both clusters are distributed over two 
datacenters, but they're on different ESX-clusters. All nodes are allowed to 
migrate between hypervisors.

On the external cluster, "celery status" gives me one resource manager, on the 
external cluster I get either two or three resource managers. As far as I 
understand, I can run the resource manager on all nodes, but should only see 
one in celery, because the other two nodes are going into standby.

Running "ps fauxwww |grep resource_manage[r]" on the external cluster gives me 
four processes in the whole cluster. The currently active resource manager has 
two processes, the other ones have one process each. However, on the internal 
cluster I get six processes, two on each node.

From my understanding, the external cluster works correctly, as the active 
resource manager has one process to communicate with celery, and one to do 
work, with the other two nodes only having one active process to communicate 
with celery and become active in case the currently active resource manager 
goes down.

Oddly enough, celery seems to also disconnect it's own workers:

"Jan 10 08:52:36 pulp02 pulp[101629]: celery.worker.consumer:INFO: missed 
heartbeat from reserved_resource_worker-1@pulp02". As such, I think we can 
eliminate the network"

I'm completely stumped and don't even have a real clue of what logs I could 
provide, or where to start looking into things.

Grateful for any help,
Sebastian








_______________________________________________
Pulp-list mailing list
[email protected]<mailto:[email protected]>
https://www.redhat.com/mailman/listinfo/pulp-list


Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Pulp-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-list

Reply via email to