Hi,

On 07/06/2017 03:14 PM, Uwe Sauter wrote:
Hi Thomas,

thank you for your insight.


1) I was wondering how a PVE (4.4) cluster will behave when one of the nodes is 
restarted / shutdown either via WebGUI or via
commandline. Will hosted, HA-managed VMs be migrated to other hosts before 
shutting down or will they be stopped (and restared on
another host once HA recognizes them as gone)?
First: on any graceful shutdown, which triggers stopping the pve-ha-lrm service,
all HA managed services will be queued to stop (graceful shutdown with timeout).
This is done to ensure consistency.

If a HA service gets then recovered to another node, or "waits" until the 
current
node comes up again depends if you triggered a shutdown or a reboot.
On a shutdown the service will be recovered after the node is seen as "dead" 
(~2 minutes)
but on a reboot we mark the service as freezed, so the ha stack does not 
touches it.
The idea here is that if a user reboots the node without migrating away a 
service he expects
that the node comes up again fast and starts the service on its own again.
Now, we know that this may not always be ideal, especially on really big 
machines
with hundreds of gigabyte of RAM and a slow as hell firmware, where a boot may 
need > 10 minutes.
Understood. This is also kind of what I expected.

What is still unclear to me is what you consider a "graceful" shutdown? Every 
action that stops pve-ha-lrm?

No, not every action which stops the pve-ha-lrm.
If it gets a stop request by anyone we check if a shutdown or reboot is in progress, if so we know that we have to stop/shutdown the services. If no shutdown or reboot is in progress we just freeze the services and to not touch them, this is done as the only case where this happens is the one where an user manually triggers an stop via:
# systemctl stop pve-ha-lrm
or
# systemctl restart pve-ha-lrm
in both cases stopping running services is probably unwanted, we expect that the user knows why he does this. One reason could be to shutdown the LRM watchdog connection as quorum loss is expected in the next minutes.

An idea is to allow the configuration of the behavior and add two additional 
behaviors,
i.e. migrate away and relocate away.
What's the difference between migration and relocation? Temporary vs. permanent?

Migration does an online migration if possible (=on VMs) and the services is already running.
Relocation *always* stops the service if it runs  and only then migrates it.
If it then gets started on the other side again depends on the request state.

The latter one may be useful on really big VMs where short down time can be accepted and online migration would need far to long or cause congestion on the network.

2) Currently I run a cluster of four nodes that share the same 2U chassis:

+-----+-----+
|  A  |  B  |
+-----+-----+
|  C  |  D  |
+-----+-----+

(Please don't comment on whether this setup is ideal – I'm aware of the risks a 
single chassis brings…)
As long as nodes share continents your never save anyway :-P
True, but impossible to implement for approx. 99.999999% of all PVE users. And 
latencies will be a nightmare then, esp. with Ceph :D

Haha, yeah, would be quite a nightmare, if you haven't your own sea cable connection :D

I created several HA groups:

- left  contains A & C
- right contains B & D
- upper contains A & B
- lower contains C & D
- all   contains all nodes

and configured VMs to run inside one of the groups.

For updates I usually follow the following steps:
- migrate VMs from node via "bulk migrate" feature, selecting one of the other 
nodes
- when no more VMs run, do a "apt-get dist-upgrade" and reboot
- repeat till all nodes are up-to-date

One issue I ran into with this procedure is that sometimes while a VM is still 
migrated to another host, already migrated VMs are
migrated back onto the current node because the target that was selected for "bulk 
migrate" was not inside the same group as the
current host.
This is expected, you told the ha-manager that a service should or can not run 
there,
thus it tried to bring it in an "OK" state again.
Yes, I was aware of the reasons why the VM was moved back, though it would make 
more sense to move it to another node in the same
(allowed) group for the maintenance case I'm describing here.

Practical example:
- VM 101 is configured to run on the left side of the cluster
- VM 102 is configured to run on the lower level of the cluster
- node C shall be updated
- I select "bulk migrate" to node D
- VM 101 is migrated to D
- VM 102 is migrated to D, but takes some time (a lot of RAM)
- HA recognizes that VM 101 is not running in the correct group and schedules a 
migration back to node C
- migration of VM 102 finishes and migration of VM 101 back to node C 
immediatelly starts
- once migration of VM 101 has finished I manually need to initate another 
migration (and after that need to be faster then HA to
do a reboot)


Would it be possible to implement another "bulk action" that will evacuate a 
host in a way that for every VM, the appropriate
target node is selected, depending on HA group configuration? This might also 
temporarily disable that node in HA management for
e.g. 10min or until next reboot so that maintenance work can be done…
What do you think of that idea?

Quasi, a maintenance mode? I'm not opposed to it, but if such a thing would be 
done
it would be only a light wrapper around already existing functionality.
Absolutely. Just another action that would evacuate the current host as optimal 
as possible. All VMs that are constrained to a
specific node group should be migrated within that group, all other VMs should 
be migrated to any node available (possible doing
some load balancing inside the cluster).

I'll look again in this, if I get an idea how to incorporate this without breaking edge cases I can give it a shot,
no promise yet, though, sorry :)

Can I ask if whats the reason for your group setup?
I assume that all VMs may run on all nodes, but you want to "pin" some VMs to 
specific nodes for load reasons?
We started to build a cluster out of just one chassis with four nodes. In the 
next few weeks I will add additional nodes that
possibly be located in another building. Those nodes will be grouped similarily 
and there will be additional groups that include
subsets of nodes from each building.

The reason behind my group setup is that I have two projects which have several 
services that are running on two VMs each (for
redundency and load balancing, e.g. LDAP). A configuration where  one LDAP is running 
"left" and the other is running "right"
eliminates the risk that both VMs run on the same node (and have a disruption 
of service if that particular node fails).
So for the first project I distribute all important VMs between "left" and 
right" and the other project's important VMs are
distrbuted between "upper" and "lower". This ensures that for both projects, 
important services are not interrupted if *one* node
fails.
All less-important VMs are allowed to run on all nodes.

If there are valid concerns against this reasoning, I'm open to suggestions for 
improvement.

Sounds OK, I have to think about it if I can propose a better fitting solution regarding our HA stack.
An idea was to add simple dependencies, i.e. this group/service should
not run on the same node as the other group/services. Not sure if this is quite specialism or more people would profit from it...

If this is the case I'd suggest changing the group configuration.
I.e. each node gets a group, A, B, C and D. Each group has the respective node 
with priority 2 and all others with priority 1.
When doing an system upgrade on node A you would edit group A and set node A's 
priority to 0,
now all should migrate away from this node, trying to balance the service count 
over all nodes.
You do not need to trigger a bulk action, at least for the HA managed VMs.

After all migrated execute the upgrade and reboot.
Then reconfigure the Group A that node A has again the highest priority,
i.e. 2, and the respective services migrate back to it again.

This should be quite fast to do after the initial setup, you just need to open 
the group configuration
dialog and lower/higher the priority of one node.

You could also use a simmilar procedure on your current group configuration.
The main thing what changes is that you need to edit two groups to make a node 
free.
The advantage of mine method would be that the services get distributed on all 
other nodes not just moved to a single one.
Interesting idea. Didn't have a look at priorities yet.

Request for improvement: In "datacenter -> HA -> groups" show the configured 
priority, e.g. in a format
"nodename(priority)[,nodename(priority)]"

Hmm, this should already be the case, except if the default priority is set.
I added this when I reworked the HA group editor sometimes in 4.3.

cheers,
Thomas


_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to