Would it be possible for you to test the latest upstream kernel? Refer
to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest
v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following
tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag:
'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as
"Confirmed".


Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc5

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1679208

Title:
  Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with
  intel_iommu=on

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  TL;DR
  - one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
  - the Disk controller fails
  - Xenial seems to work for a while but then fails
  - Zesty 100% crashes on boot
  - An identical system seems to work, so need HW replace to finally confirm

  After reboot one sees a HW report like this:
  After the boot I see the HW telling me this on boot:
  Embedded RAID : Smart HBA H240ar Controller - Operation Failed
   - 1719-Slot 0 Drive Array  - A controller failure event occurred prior
     to this power-up. (Previous lock up code = 0x13)

  
  I tried several things (In between always redeploy zesty with MAAS).
  I think my debugging might be helpful, but I wanted to keep the documentation 
in the bug in case you'd go another route or that others find useful 
information in here.

  0. I retried what I did twice, fully reproducible
     That is:
     0.1 install zesty 
     0.2 change grub default cmdline in /etc/default/grub.d/50- to add 
intel_iommu=on
     0.3 sudo update-grub
     0.4 reboot


  1. I tried a Recovery boot from the boot options in gub.
     => Failed as well


  2. iLO rebooted vis "request reboot" and as well via "full system reset"
     => both Failed


  3. Reboot the system as deployed by MAAS
     # /proc/cmdline before that
     BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
     The orig grub.cfg is like http://paste.ubuntu.com/24305945/
     It reboots as-is.
     => Reboot worked


  4. without a change to anything in /etc run update-grub
     $ sudo update-grub
     Generating grub configuration file ...
     Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT 
is set is no longer supported.
     Found linux image: /boot/vmlinuz-4.10.0-14-generic
     Found initrd image: /boot/initrd.img-4.10.0-14-generic
     Adding boot menu entry for EFI firmware configuration
     done

     There was no diff between the new grub.cfg and the one I saved.
     => Reboot worked


  5. add the intel_iommu=on arg
    $ sudo sed -i 
's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' 
/etc/default/grub.d/50-curtin-settings.cfg
    $ sudo update-grub
    # Diff in grub.cfg really only is the iommu setting
    => Reboot Failed
    So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me 
- maybe intel_iommu bheaves different?
  - Check grub cfg pre/post - not change but the expected?


  6. Install Xenial and do the same
     => Reboot working


  7. Upgrade to Z
     Since the Xenial system just worked and one can assume that almost only 
kernel is working so early in the boot process I upgraded the working system 
with intel_iommu=on to Zesty.
     That would be 4.4.0-71-generic to 4.10.0-1
     On this upgrade I finally saw my I/O errors again :-/
     Note: these issues are hard to miss as they mount root as read-only.
     I wonder if they only ever appear with intel_iommu=on as this is the only 
combo I ever saw them,


  8. Redeploy and upgrade to Z without intel_iommu=on enabled
     Then enable intel_iommu=on and reboot
     => Reboot Fail
     From here I rebooted into the Xenial kerenl (that since this is an update 
was still there)
     Here I saw:
      Loading Linux 4.4.0-71-generic ...
      Loading initial ramdisk ...
      error: invalid video mode specification `text'.
      Booting in blind mode
     Hrm, as outlined above the "blind mode" might be a red herring, but since 
this kernel worked before it might still be a red herring that swims in the 
initrd that got regenerated on the upgrade.
     => Xenial Kernel Reboot - works !!
     So "blind mode" is a red herring of some sort.
     
     But this might allow to find some logs
     => No
     This appears as if the Failing boot has never made it to the point to 
actually write anything.
     I see:
      1. the original xenial
      2. the upgraded zesty
      3. NOT THE zesty+iommu
      4. the xenial+iommu

  $ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog 
  Apr  3 12:15:20 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
  Apr  3 12:15:20 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
  Apr  3 12:47:45 node-horsea kernel: [    0.000000] Linux version 
4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 
6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 
4.10.0-14.16-generic 4.10.3)
  Apr  3 12:47:45 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
  Apr  3 13:15:49 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
  Apr  3 13:15:49 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on


  9. Trying to avoiding HW replacement if not needed
  I was afraid I might need the HW to be replaced to be 100% sure, but this 
very much smells broken in SW to me already.
  To avoid RT ticket replacing without real need I asked to free another system 
up.

  So I finally could free up a identical machine.
  I especially checked the failing HP smart array, it has the same Product 
Version and FW revision.

  There things seem to work, so I might be down to replacing the HW :-/


  10. get some messages of the fail:
  With the following grub cmdline I got to see the fail:
  GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"

  It looks just like the one I found on the running system when intel_iommu=on 
is set on the Xenial kernel happening later (sometimes minutes, sometimes days, 
but never without intel_iommu).
  But on zesty it seems to trigger 100% on boot and by that not even get up.

  I'll attach a few logs of the crashes, but the heads are
  [   33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD 
Smart Path configuration change)
  [  618.567636] DMAR: DRHD: handling fault status reg 2
  [  618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr 
ffafc000 
                 DMAR:[fault reason 06] PTE Read access is not set

  Or
  [  159.779566] hpsa 0000:03:00.0: Command timed out.
  [  159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: 
Tag:0x00000000:000000d0: unknown abort service response 0x00


  
  While it might be a HW issue I file this still to be "findable" for anyone 
else if it is no HW eventually.
  But I assign myself for now to close/confirm once I have replaced HW.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to