This bug is missing log files that will aid in diagnosing the problem.
>From a terminal window please run:

apport-collect 1679208

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that fact and change
the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the
Ubuntu Kernel Team.

** Changed in: linux (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1679208

Title:
  Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with
  intel_iommu=on

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  TL;DR
  - one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
  - the Disk controller fails
  - Xenial seems to work for a while but then fails
  - Zesty 100% crashes on boot
  - An identical system seems to work, so need HW replace to finally confirm

  After reboot one sees a HW report like this:
  After the boot I see the HW telling me this on boot:
  Embedded RAID : Smart HBA H240ar Controller - Operation Failed
   - 1719-Slot 0 Drive Array  - A controller failure event occurred prior
     to this power-up. (Previous lock up code = 0x13)

  
  I tried several things (In between always redeploy zesty with MAAS).
  I think my debugging might be helpful, but I wanted to keep the documentation 
in the bug in case you'd go another route or that others find useful 
information in here.

  0. I retried what I did twice, fully reproducible
     That is:
     0.1 install zesty 
     0.2 change grub default cmdline in /etc/default/grub.d/50- to add 
intel_iommu=on
     0.3 sudo update-grub
     0.4 reboot


  1. I tried a Recovery boot from the boot options in gub.
     => Failed as well


  2. iLO rebooted vis "request reboot" and as well via "full system reset"
     => both Failed


  3. Reboot the system as deployed by MAAS
     # /proc/cmdline before that
     BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
     The orig grub.cfg is like http://paste.ubuntu.com/24305945/
     It reboots as-is.
     => Reboot worked


  4. without a change to anything in /etc run update-grub
     $ sudo update-grub
     Generating grub configuration file ...
     Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT 
is set is no longer supported.
     Found linux image: /boot/vmlinuz-4.10.0-14-generic
     Found initrd image: /boot/initrd.img-4.10.0-14-generic
     Adding boot menu entry for EFI firmware configuration
     done

     There was no diff between the new grub.cfg and the one I saved.
     => Reboot worked


  5. add the intel_iommu=on arg
    $ sudo sed -i 
's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' 
/etc/default/grub.d/50-curtin-settings.cfg
    $ sudo update-grub
    # Diff in grub.cfg really only is the iommu setting
    => Reboot Failed
    So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me 
- maybe intel_iommu bheaves different?
  - Check grub cfg pre/post - not change but the expected?


  6. Install Xenial and do the same
     => Reboot working


  7. Upgrade to Z
     Since the Xenial system just worked and one can assume that almost only 
kernel is working so early in the boot process I upgraded the working system 
with intel_iommu=on to Zesty.
     That would be 4.4.0-71-generic to 4.10.0-1
     On this upgrade I finally saw my I/O errors again :-/
     Note: these issues are hard to miss as they mount root as read-only.
     I wonder if they only ever appear with intel_iommu=on as this is the only 
combo I ever saw them,


  8. Redeploy and upgrade to Z without intel_iommu=on enabled
     Then enable intel_iommu=on and reboot
     => Reboot Fail
     From here I rebooted into the Xenial kerenl (that since this is an update 
was still there)
     Here I saw:
      Loading Linux 4.4.0-71-generic ...
      Loading initial ramdisk ...
      error: invalid video mode specification `text'.
      Booting in blind mode
     Hrm, as outlined above the "blind mode" might be a red herring, but since 
this kernel worked before it might still be a red herring that swims in the 
initrd that got regenerated on the upgrade.
     => Xenial Kernel Reboot - works !!
     So "blind mode" is a red herring of some sort.
     
     But this might allow to find some logs
     => No
     This appears as if the Failing boot has never made it to the point to 
actually write anything.
     I see:
      1. the original xenial
      2. the upgraded zesty
      3. NOT THE zesty+iommu
      4. the xenial+iommu

  $ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog 
  Apr  3 12:15:20 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
  Apr  3 12:15:20 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
  Apr  3 12:47:45 node-horsea kernel: [    0.000000] Linux version 
4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 
6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 
4.10.0-14.16-generic 4.10.3)
  Apr  3 12:47:45 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
  Apr  3 13:15:49 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
  Apr  3 13:15:49 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on


  9. Trying to avoiding HW replacement if not needed
  I was afraid I might need the HW to be replaced to be 100% sure, but this 
very much smells broken in SW to me already.
  To avoid RT ticket replacing without real need I asked to free another system 
up.

  So I finally could free up a identical machine.
  I especially checked the failing HP smart array, it has the same Product 
Version and FW revision.

  There things seem to work, so I might be down to replacing the HW :-/


  10. get some messages of the fail:
  With the following grub cmdline I got to see the fail:
  GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"

  It looks just like the one I found on the running system when intel_iommu=on 
is set on the Xenial kernel happening later (sometimes minutes, sometimes days, 
but never without intel_iommu).
  But on zesty it seems to trigger 100% on boot and by that not even get up.

  I'll attach a few logs of the crashes, but the heads are
  [   33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD 
Smart Path configuration change)
  [  618.567636] DMAR: DRHD: handling fault status reg 2
  [  618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr 
ffafc000 
                 DMAR:[fault reason 06] PTE Read access is not set

  Or
  [  159.779566] hpsa 0000:03:00.0: Command timed out.
  [  159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: 
Tag:0x00000000:000000d0: unknown abort service response 0x00


  
  While it might be a HW issue I file this still to be "findable" for anyone 
else if it is no HW eventually.
  But I assign myself for now to close/confirm once I have replaced HW.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to