Re: Kernel Panic after online migration from SLES12 SP3 to SP4

Gerald Schaefer Tue, 11 Jun 2019 06:33:19 -0700

On Mon, 10 Jun 2019 17:47:20 +0000
Mark Bullis <[email protected]> wrote:


> We attempted to upgrade 4 SUSE linux  z/VM 6.4 guests this last weekend and 2 
> of them failed with
>
> {FAILED] Failed to start Load Kernel Modules
>
> Many Out of Memory messages
>
> Kernel panic - not syncing:  Out of memory and no killable processes
>
> CPU:  1 PID:250 Comm kworker/u128:2 Not tainted 4.4.131-94.29-default #1
>
> Then CP entered; disable wait.....
>
> The upgrade was successful, but got the above after the first boot.
>
> The 2 upgrades that succeeded uses a kernel 4.12.14-95.16-default.  Had to 
> restore the faild systems from backups.
>
> Opened an S.R. with SUSE, and the engineer told be to upgrade again, but 
> before rebooting check grub and make sure it selects the right kernel.
> Has anyone else seen this?  The 2 failed guests are both 20GiB oracle 
> database servers with hugepages turned on.  The 2 that succeeded do not use 
> hugepages and are only 6GiB in size.

This has been reported via LTC bug#175823 and SUSE bug#1127293.

It is because of the special mechanism that is used for grub2 on SLES. The 
first kernel (stage 1) gets booted via zipl and it will present the grub menu 
and load the second kernel (stage 2) via kexec. For historic reasons, the stage 
1 kernel is always booted with mem=1G parameter appended. If you now configure 
hugepages on your system, e.g. via sysctl.conf, that setting will also be 
propagated to the stage 1 kernel initrd, as soon as it is being rebuilt.

Normal kernel maintweb updates do not rebuild the stage 1 kernel and its 
initrd, they only change the stage 2 kernel. However, during SP3 -> SP4 update, 
the stage 1 kernel and intird apparently are rebuilt, resulting in a stage 1 
kernel with restricted 1 GB memory trying to allocate tons of hugepages and 
going out-of-memory before it can do the kexec for the stage 2 kernel.

There are two options to fix it, either remove the hugepages pre-allocation 
setting before the SP3 -> SP4 update, or remove the "mem=1G" parameter for the 
stage 1 kernel in /etc/default/zipl2grub.conf.in. The latter was chosen by SUSE 
to resolve the bugzilla, by providing a grub2 PTF rpm to the customer. If you 
already have a S.R. with SUSE, you could point them to SUSE bug#1127293.

In order to get into the system after the upgrade, you can try to skip the 
normal stage 1/2 mechanism with the (hidden) zipl boot menu in SLES:
- IPL with LOADPARM 2 (=> skip-grub) to side-step 'mem=1G' (this will boot up 
the system with the stage 1 kernel only, but w/o mem=1G),
- after (presumably) successful boot log on as root,
- remove "mem=1G" from '/etc/default/zipl2grub.conf.in'
- run 'grub2-install --force' to establish the new kernel in '/boot/zipl'

Regards,
Gerald Schaefer

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390

Re: Kernel Panic after online migration from SLES12 SP3 to SP4

Reply via email to