I've verified this LP using AWS instances, so the kernel versions I
tested were -aws flavored; this was necessary given the patch proposed
here is related to the ena driver, which manages an AWS exclusive
virtual NIC.

The versions tested were:

4.4.0.1105 xenial-updates
4.4.0.1106 xenial-proposed
4.15.0.1065 bionic-updates
4.15.0.1066 bionic-proposed
5.3.0.1016 eoan-updates (tested with Bionic-HWE analog)
5.3.0.1017 eoan-proposed (tested with Bionic-HWE analog)

The test was based in the [Test] section / comment #3, and I've managed
to reproduce the issue in all versions on -updates, whereas no version
in -proposed showed the issue (20 kexecs succeeded). In the failure
case, in the 2nd or at most 3rd kexec, we've noticed the crash on boot,
initrd corruption. In Xenial (kernel 4.4), due to the "small" size of
initrd, it was needed to install linux-modules-extra to increase the
size of the file and hence expose the memory corruption.

Also, I've checked in all kernels if the symbols added by the patch were
there, with the following command:

# grep "ena_remov\|ena_shut" /proc/kallsyms
ffffffffc0005690 t __ena_shutoff        [ena]
ffffffffc0005770 t ena_shutdown [ena]
ffffffffc0005790 t ena_remove   [ena]

The above output is from a -proposed kernel; kernels in -updates only
show ena_remove() symbol. Finally, I checked the patch in the generic
flavors trees, for X/B/E/F, and they are present in the latest tag
(corresponding to kernels in -proposed and tag Ubuntu-5.4.0-22.26 for
Focal).

So, I'm hereby marking this LP as verified for all releases. See next comment 
for a note about Disco kernel.
Thanks,


Guilherme

** Tags removed: verification-needed-bionic verification-needed-eoan 
verification-needed-xenial
** Tags added: verification-done verification-done-bionic 
verification-done-eoan verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1869948

Title:
  Multiple Kexec in AWS Nitro instances fail

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Disco:
  Opinion
Status in linux source package in Eoan:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  [Impact]
  * Currently, users cannot perform multiple kernel kexec loads on AWS Nitro 
instances (KVM-based); after the 2nd or 3rd kexec, an initrd corruption is 
observed, with the following signature:

   Initramfs unpacking failed: junk within compressed archive
  [...]
   Kernel panic - not syncing: No working init found.
  Try passing init= option to kernel. See Linux 
Documentation/admin-guide/init.rst for guidance.
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc7-gpiccoli+ #26  Hardware 
name: Amazon EC2 t3.large/, BIOS 1.0 10/16/2017
  Call Trace:
    dump_stack+0x6d/0x9a
    ? csum_partial_copy_generic+0x150/0x170
    panic+0x101/0x2e3
    ? do_execve+0x25/0x30
    ? rest_init+0xb0/0xb0
    kernel_init+0xfb/0x100
    ret_from_fork+0x35/0x40

  * After investigation (see comment 2), it was noticed the Amazon ena
  network driver doesn't provide a shutdown() handler, hence it could be
  performing a DMA transaction to a previous valid address during boot,
  which would then corrupt kernel memory. The following patch was
  proposed and fixed the issue, allowing 1000 kexecs to be executed
  successfully with no issues observed: 428c491332bc("net: ena: Add PCI
  shutdown handler to allow safe kexec") [
  git.kernel.org/linus/428c491332bc ].

  * Hence, we are hereby requesting SRU for this patch. It was tested in
  all supported series (4.4, 4.15 and 5.3) in Amazon Nitro instances
  with success, and reviewed/acked by ena driver team and a kexec
  developer from other distro. Worth mentioning that we proposed an
  upstream multi-vendor discussion about this issue:
  marc.info/?l=kexec&m=158299605013194

  [Test case]

  * The basic test procedure is about performing multiple kexecs
  sequentially; AWS does not provide a full console, so in case of
  failures one could check the instance screenshot or use pstore/ramoops
  in order to collect dmesg after a crash in a preserved memory area.
  The commands used to perform kexec are:

  kexec -l <kernel file> --initrd <initrd file> --reuse-cmdline
  systemctl kexec

  Alternatively, one could user "--append=" instead of "--reuse-cmdline"
  if a change in kexec command-line is desired; also, to execute the
  kexec-loaded kernel both "kexec -e" and "systemctl kexec" are equally
  valid.

  * On comment 3 we proposed a script/approach to auto-test kexecs, used
  here to perform 1000 kexecs with the proposed patch.

  [Regression Potential]

  * Although the patch proposed here introduce a PCI handler, it kept
  the remove handler identical and based shutdown strongly on
  ena_remove(), changing just netdev handling following other upstream
  drivers. It was extensively tested and presented no issue. Also, it's
  self-contained and affect only one driver, so any other cloud
  providers or non-cloud environment wouldn't be even affected by the
  patch.

  * In case of a potential regression, it could manifest as a delay or
  issue on reboot/shutdown path, only if ena driver is in use.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1869948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to