Update: We seem to have found the issue. Infrastructure told me that there is an issue that can pause the VMs anywhere from nanoseconds to seconds, possibly hundreds of times with only splitseconds between the pauses. Thus, if the active manager pauses, a standby takes over. The paused manager comes back, and we have two managers.
By that point, the only bug that’s actually pulp related is that the active managers don’t check for other active managers, I guess. Regards, Sebastian > Am 10.01.2018 um 12:37 schrieb Sebastian Sonne <[email protected]>: > > Hello everyone. > > I have two pulp clusters, each containing three nodes, all systems are up to > date (pulp 2.14.3). However, the cluster behavior differs greatly. Let's call > the working cluster the external one, and the broken one internal. > > The setup: Everything is virtualized. Both clusters are distributed over two > datacenters, but they're on different ESX-clusters. All nodes are allowed to > migrate between hypervisors. > > On the external cluster, "celery status" gives me one resource manager, on > the external cluster I get either two or three resource managers. As far as I > understand, I can run the resource manager on all nodes, but should only see > one in celery, because the other two nodes are going into standby. > > Running "ps fauxwww |grep resource_manage[r]" on the external cluster gives > me four processes in the whole cluster. The currently active resource manager > has two processes, the other ones have one process each. However, on the > internal cluster I get six processes, two on each node. > > From my understanding, the external cluster works correctly, as the active > resource manager has one process to communicate with celery, and one to do > work, with the other two nodes only having one active process to communicate > with celery and become active in case the currently active resource manager > goes down. > > Oddly enough, celery seems to also disconnect it's own workers: > > "Jan 10 08:52:36 pulp02 pulp[101629]: celery.worker.consumer:INFO: missed > heartbeat from reserved_resource_worker-1@pulp02". As such, I think we can > eliminate the network" > > I'm completely stumped and don't even have a real clue of what logs I could > provide, or where to start looking into things. > > Grateful for any help, > Sebastian > > > Sebastian Sonne > Systems & Applications (OSA) > noris network AG > Thomas-Mann-Strasse 16−20 > 90471 Nürnberg > Deutschland > Tel +49 911 9352 1184 > Fax +49 911 9352 100 > > [email protected] > https://www.noris.de - Mehr Leistung als Standard > Vorstand: Ingo Kraupa (Vorsitzender), Joachim Astel, Jürgen Städing > Vorsitzender des Aufsichtsrats: Stefan Schnabel - AG Nürnberg HRB 17689 > > > > > _______________________________________________ > Pulp-list mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-list Sebastian Sonne Systems & Applications (OSA) noris network AG Thomas-Mann-Strasse 16−20 90471 Nürnberg Deutschland Tel +49 911 9352 1184 Fax +49 911 9352 100 [email protected] https://www.noris.de - Mehr Leistung als Standard Vorstand: Ingo Kraupa (Vorsitzender), Joachim Astel, Jürgen Städing Vorsitzender des Aufsichtsrats: Stefan Schnabel - AG Nürnberg HRB 17689
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Pulp-list mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-list
