To add to this, instead of issuing a straight reboot, I prefer running
'pcs stonith fence <node>' which will fail over resources appropriately
AND reboot the node (if doable) or otherwise power it off. The advantage
to doing it this way is that it keeps Pacemaker in-the-know about the
state of the node so it doesn't (usually) shoot it as it's trying to
boot back up. When you're doing maintenance on a node without letting
Pacemaker know about it, results can be unpredictable.
Cameron
On 3/5/25 2:12 PM, Laura Hild via lustre-discuss wrote:
I'm not sure what to say about how Pacemaker *should* behave, but I *can* say I
virtually never try to (cleanly) reboot a host from which I have not already
evacuated all resources, e.g. with `pcs node standby` or by putting Pacemaker
in maintenance mode and unmounting/exporting everything manually. If I can't
evacuate all resources and complete a lustre_rmmod, the host is getting
power-cycled.
So maybe I can say, my guess would be that in the host's shutdown process,
stopping the Pacemaker service happens before filesystems are unmounted, and
that Pacemaker doesn't want to make an assumption whether its own shut-down
means it should standby or initiate maintenance mode, and therefore the other
host ends up knowing only that its partner has disappeared, while the
filesystems have yet to be unmounted.
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!0RFI5fXw0SvxL-3t8fqoESM6EpPmNWAltjI8fbf9DcPG9n25cKHYbYq8Vgvp_9RgVVAzDg8YrfM_xqAwLvKjxP7NqvwdWQ$
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org