Hi,
On 07/06/2017 03:14 PM, Uwe Sauter wrote:
Hi Thomas,
thank you for your insight.
1) I was wondering how a PVE (4.4) cluster will behave when one of the nodes is
restarted / shutdown either via WebGUI or via
commandline. Will hosted, HA-managed VMs be migrated to other hosts before
shutting down or will they be stopped (and restared on
another host once HA recognizes them as gone)?
First: on any graceful shutdown, which triggers stopping the pve-ha-lrm service,
all HA managed services will be queued to stop (graceful shutdown with timeout).
This is done to ensure consistency.
If a HA service gets then recovered to another node, or "waits" until the
current
node comes up again depends if you triggered a shutdown or a reboot.
On a shutdown the service will be recovered after the node is seen as "dead"
(~2 minutes)
but on a reboot we mark the service as freezed, so the ha stack does not
touches it.
The idea here is that if a user reboots the node without migrating away a
service he expects
that the node comes up again fast and starts the service on its own again.
Now, we know that this may not always be ideal, especially on really big
machines
with hundreds of gigabyte of RAM and a slow as hell firmware, where a boot may
need > 10 minutes.
Understood. This is also kind of what I expected.
What is still unclear to me is what you consider a "graceful" shutdown? Every
action that stops pve-ha-lrm?
No, not every action which stops the pve-ha-lrm.
If it gets a stop request by anyone we check if a shutdown or reboot is
in progress, if so we know that we have to stop/shutdown the services.
If no shutdown or reboot is in progress we just freeze the services and
to not touch them, this is done as the only case where this happens is
the one where an user manually triggers an stop via:
# systemctl stop pve-ha-lrm
or
# systemctl restart pve-ha-lrm
in both cases stopping running services is probably unwanted, we expect
that the user knows why he does this.
One reason could be to shutdown the LRM watchdog connection as quorum
loss is expected in the next minutes.
An idea is to allow the configuration of the behavior and add two additional
behaviors,
i.e. migrate away and relocate away.
What's the difference between migration and relocation? Temporary vs. permanent?
Migration does an online migration if possible (=on VMs) and the
services is already running.
Relocation *always* stops the service if it runs and only then migrates it.
If it then gets started on the other side again depends on the request
state.
The latter one may be useful on really big VMs where short down time can
be accepted
and online migration would need far to long or cause congestion on the
network.
2) Currently I run a cluster of four nodes that share the same 2U chassis:
+-----+-----+
| A | B |
+-----+-----+
| C | D |
+-----+-----+
(Please don't comment on whether this setup is ideal – I'm aware of the risks a
single chassis brings…)
As long as nodes share continents your never save anyway :-P
True, but impossible to implement for approx. 99.999999% of all PVE users. And
latencies will be a nightmare then, esp. with Ceph :D
Haha, yeah, would be quite a nightmare, if you haven't your own sea
cable connection :D
I created several HA groups:
- left contains A & C
- right contains B & D
- upper contains A & B
- lower contains C & D
- all contains all nodes
and configured VMs to run inside one of the groups.
For updates I usually follow the following steps:
- migrate VMs from node via "bulk migrate" feature, selecting one of the other
nodes
- when no more VMs run, do a "apt-get dist-upgrade" and reboot
- repeat till all nodes are up-to-date
One issue I ran into with this procedure is that sometimes while a VM is still
migrated to another host, already migrated VMs are
migrated back onto the current node because the target that was selected for "bulk
migrate" was not inside the same group as the
current host.
This is expected, you told the ha-manager that a service should or can not run
there,
thus it tried to bring it in an "OK" state again.
Yes, I was aware of the reasons why the VM was moved back, though it would make
more sense to move it to another node in the same
(allowed) group for the maintenance case I'm describing here.
Practical example:
- VM 101 is configured to run on the left side of the cluster
- VM 102 is configured to run on the lower level of the cluster
- node C shall be updated
- I select "bulk migrate" to node D
- VM 101 is migrated to D
- VM 102 is migrated to D, but takes some time (a lot of RAM)
- HA recognizes that VM 101 is not running in the correct group and schedules a
migration back to node C
- migration of VM 102 finishes and migration of VM 101 back to node C
immediatelly starts
- once migration of VM 101 has finished I manually need to initate another
migration (and after that need to be faster then HA to
do a reboot)
Would it be possible to implement another "bulk action" that will evacuate a
host in a way that for every VM, the appropriate
target node is selected, depending on HA group configuration? This might also
temporarily disable that node in HA management for
e.g. 10min or until next reboot so that maintenance work can be done…
What do you think of that idea?
Quasi, a maintenance mode? I'm not opposed to it, but if such a thing would be
done
it would be only a light wrapper around already existing functionality.
Absolutely. Just another action that would evacuate the current host as optimal
as possible. All VMs that are constrained to a
specific node group should be migrated within that group, all other VMs should
be migrated to any node available (possible doing
some load balancing inside the cluster).
I'll look again in this, if I get an idea how to incorporate this
without breaking edge cases I can give it a shot,
no promise yet, though, sorry :)
Can I ask if whats the reason for your group setup?
I assume that all VMs may run on all nodes, but you want to "pin" some VMs to
specific nodes for load reasons?
We started to build a cluster out of just one chassis with four nodes. In the
next few weeks I will add additional nodes that
possibly be located in another building. Those nodes will be grouped similarily
and there will be additional groups that include
subsets of nodes from each building.
The reason behind my group setup is that I have two projects which have several
services that are running on two VMs each (for
redundency and load balancing, e.g. LDAP). A configuration where one LDAP is running
"left" and the other is running "right"
eliminates the risk that both VMs run on the same node (and have a disruption
of service if that particular node fails).
So for the first project I distribute all important VMs between "left" and
right" and the other project's important VMs are
distrbuted between "upper" and "lower". This ensures that for both projects,
important services are not interrupted if *one* node
fails.
All less-important VMs are allowed to run on all nodes.
If there are valid concerns against this reasoning, I'm open to suggestions for
improvement.
Sounds OK, I have to think about it if I can propose a better fitting
solution regarding our HA stack.
An idea was to add simple dependencies, i.e. this group/service should
not run on the same node as the other group/services. Not sure if this
is quite specialism or more people would profit from it...
If this is the case I'd suggest changing the group configuration.
I.e. each node gets a group, A, B, C and D. Each group has the respective node
with priority 2 and all others with priority 1.
When doing an system upgrade on node A you would edit group A and set node A's
priority to 0,
now all should migrate away from this node, trying to balance the service count
over all nodes.
You do not need to trigger a bulk action, at least for the HA managed VMs.
After all migrated execute the upgrade and reboot.
Then reconfigure the Group A that node A has again the highest priority,
i.e. 2, and the respective services migrate back to it again.
This should be quite fast to do after the initial setup, you just need to open
the group configuration
dialog and lower/higher the priority of one node.
You could also use a simmilar procedure on your current group configuration.
The main thing what changes is that you need to edit two groups to make a node
free.
The advantage of mine method would be that the services get distributed on all
other nodes not just moved to a single one.
Interesting idea. Didn't have a look at priorities yet.
Request for improvement: In "datacenter -> HA -> groups" show the configured
priority, e.g. in a format
"nodename(priority)[,nodename(priority)]"
Hmm, this should already be the case, except if the default priority is set.
I added this when I reworked the HA group editor sometimes in 4.3.
cheers,
Thomas
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user