Subscribing Narinder to map that to HPE if possible.
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1679208
Title:
Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with
intel_iommu=on
Status in linux package in Ubuntu:
Incomplete
Bug description:
TL;DR
- one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
- the Disk controller fails
- Xenial seems to work for a while but then fails
- Zesty 100% crashes on boot
- An identical system seems to work, so need HW replace to finally confirm
After reboot one sees a HW report like this:
After the boot I see the HW telling me this on boot:
Embedded RAID : Smart HBA H240ar Controller - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
to this power-up. (Previous lock up code = 0x13)
I tried several things (In between always redeploy zesty with MAAS).
I think my debugging might be helpful, but I wanted to keep the documentation
in the bug in case you'd go another route or that others find useful
information in here.
0. I retried what I did twice, fully reproducible
That is:
0.1 install zesty
0.2 change grub default cmdline in /etc/default/grub.d/50- to add
intel_iommu=on
0.3 sudo update-grub
0.4 reboot
1. I tried a Recovery boot from the boot options in gub.
=> Failed as well
2. iLO rebooted vis "request reboot" and as well via "full system reset"
=> both Failed
3. Reboot the system as deployed by MAAS
# /proc/cmdline before that
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
The orig grub.cfg is like http://paste.ubuntu.com/24305945/
It reboots as-is.
=> Reboot worked
4. without a change to anything in /etc run update-grub
$ sudo update-grub
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT
is set is no longer supported.
Found linux image: /boot/vmlinuz-4.10.0-14-generic
Found initrd image: /boot/initrd.img-4.10.0-14-generic
Adding boot menu entry for EFI firmware configuration
done
There was no diff between the new grub.cfg and the one I saved.
=> Reboot worked
5. add the intel_iommu=on arg
$ sudo sed -i
's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/'
/etc/default/grub.d/50-curtin-settings.cfg
$ sudo update-grub
# Diff in grub.cfg really only is the iommu setting
=> Reboot Failed
So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me
- maybe intel_iommu bheaves different?
- Check grub cfg pre/post - not change but the expected?
6. Install Xenial and do the same
=> Reboot working
7. Upgrade to Z
Since the Xenial system just worked and one can assume that almost only
kernel is working so early in the boot process I upgraded the working system
with intel_iommu=on to Zesty.
That would be 4.4.0-71-generic to 4.10.0-1
On this upgrade I finally saw my I/O errors again :-/
Note: these issues are hard to miss as they mount root as read-only.
I wonder if they only ever appear with intel_iommu=on as this is the only
combo I ever saw them,
8. Redeploy and upgrade to Z without intel_iommu=on enabled
Then enable intel_iommu=on and reboot
=> Reboot Fail
From here I rebooted into the Xenial kerenl (that since this is an update
was still there)
Here I saw:
Loading Linux 4.4.0-71-generic ...
Loading initial ramdisk ...
error: invalid video mode specification `text'.
Booting in blind mode
Hrm, as outlined above the "blind mode" might be a red herring, but since
this kernel worked before it might still be a red herring that swims in the
initrd that got regenerated on the upgrade.
=> Xenial Kernel Reboot - works !!
So "blind mode" is a red herring of some sort.
But this might allow to find some logs
=> No
This appears as if the Failing boot has never made it to the point to
actually write anything.
I see:
1. the original xenial
2. the upgraded zesty
3. NOT THE zesty+iommu
4. the xenial+iommu
$ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu
4.4.0-71.92-generic 4.4.49)
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version
4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu
6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu
4.10.0-14.16-generic 4.10.3)
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu
4.4.0-71.92-generic 4.4.49)
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on
9. Trying to avoiding HW replacement if not needed
I was afraid I might need the HW to be replaced to be 100% sure, but this
very much smells broken in SW to me already.
To avoid RT ticket replacing without real need I asked to free another system
up.
So I finally could free up a identical machine.
I especially checked the failing HP smart array, it has the same Product
Version and FW revision.
There things seem to work, so I might be down to replacing the HW :-/
10. get some messages of the fail:
With the following grub cmdline I got to see the fail:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"
It looks just like the one I found on the running system when intel_iommu=on
is set on the Xenial kernel happening later (sometimes minutes, sometimes days,
but never without intel_iommu).
But on zesty it seems to trigger 100% on boot and by that not even get up.
I'll attach a few logs of the crashes, but the heads are
[ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD
Smart Path configuration change)
[ 618.567636] DMAR: DRHD: handling fault status reg 2
[ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr
ffafc000
DMAR:[fault reason 06] PTE Read access is not set
Or
[ 159.779566] hpsa 0000:03:00.0: Command timed out.
[ 159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2:
Tag:0x00000000:000000d0: unknown abort service response 0x00
While it might be a HW issue I file this still to be "findable" for anyone
else if it is no HW eventually.
But I assign myself for now to close/confirm once I have replaced HW.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp