On Mon, 10 Jun 2019 17:47:20 +0000
Mark Bullis <[email protected]> wrote:
> We attempted to upgrade 4 SUSE linux z/VM 6.4 guests this last weekend and 2
> of them failed with
>
> {FAILED] Failed to start Load Kernel Modules
>
> Many Out of Memory messages
>
> Kernel panic - not syncing: Out of memory and no killable processes
>
> CPU: 1 PID:250 Comm kworker/u128:2 Not tainted 4.4.131-94.29-default #1
>
> Then CP entered; disable wait.....
>
> The upgrade was successful, but got the above after the first boot.
>
> The 2 upgrades that succeeded uses a kernel 4.12.14-95.16-default. Had to
> restore the faild systems from backups.
>
> Opened an S.R. with SUSE, and the engineer told be to upgrade again, but
> before rebooting check grub and make sure it selects the right kernel.
> Has anyone else seen this? The 2 failed guests are both 20GiB oracle
> database servers with hugepages turned on. The 2 that succeeded do not use
> hugepages and are only 6GiB in size.
This has been reported via LTC bug#175823 and SUSE bug#1127293.
It is because of the special mechanism that is used for grub2 on SLES. The
first kernel (stage 1) gets booted via zipl and it will present the grub menu
and load the second kernel (stage 2) via kexec. For historic reasons, the stage
1 kernel is always booted with mem=1G parameter appended. If you now configure
hugepages on your system, e.g. via sysctl.conf, that setting will also be
propagated to the stage 1 kernel initrd, as soon as it is being rebuilt.
Normal kernel maintweb updates do not rebuild the stage 1 kernel and its
initrd, they only change the stage 2 kernel. However, during SP3 -> SP4 update,
the stage 1 kernel and intird apparently are rebuilt, resulting in a stage 1
kernel with restricted 1 GB memory trying to allocate tons of hugepages and
going out-of-memory before it can do the kexec for the stage 2 kernel.
There are two options to fix it, either remove the hugepages pre-allocation
setting before the SP3 -> SP4 update, or remove the "mem=1G" parameter for the
stage 1 kernel in /etc/default/zipl2grub.conf.in. The latter was chosen by SUSE
to resolve the bugzilla, by providing a grub2 PTF rpm to the customer. If you
already have a S.R. with SUSE, you could point them to SUSE bug#1127293.
In order to get into the system after the upgrade, you can try to skip the
normal stage 1/2 mechanism with the (hidden) zipl boot menu in SLES:
- IPL with LOADPARM 2 (=> skip-grub) to side-step 'mem=1G' (this will boot up
the system with the stage 1 kernel only, but w/o mem=1G),
- after (presumably) successful boot log on as root,
- remove "mem=1G" from '/etc/default/zipl2grub.conf.in'
- run 'grub2-install --force' to establish the new kernel in '/boot/zipl'
Regards,
Gerald Schaefer
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390