** Changed in: linux (Ubuntu)
     Assignee: Canonical Kernel Team (canonical-kernel-team) => Chris J Arges 
(arges)

** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu)
       Status: Triaged => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1512593

Title:
  :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover
  from HMI (e.g. Core Unit Checkstop)

Status in linux package in Ubuntu:
  In Progress

Bug description:
  Problem Description
  ==================================
  I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 
on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was 
expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 
) and a Sapphire dump to be collected in the process.

  However on injecting the error , the host ( Ubuntu NV ) crashed, and
  never recovered. OPAL however seemed to stay up and there were a
  plethora of mail box errors - B182953C logged by the FSP.

  Error Inject
  --------------
  $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
  s1.ex   k0:n0:s0:p03:c5
  ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5

  
  Console message
  --------------
  [htx@gp4p01]  [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor 
Maintenance interrupt [Not recovered]
  [ 1652.976332]  Error detail: Malfunction Alert
  [ 1652.976402]  HMER: 8040000000000000
  [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception
  [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P           OE 
3.16.0-37-generic #51~14.04.1-Ubuntu
  [ 1652.976530] Workqueue: events hmi_event_handler
  [ 1652.976561] Call Trace:
  [ 1652.976571] [c0000000189bf9e0] [c000000000017330] show_stack+0x170/0x290 
(unreliable)
  [ 1652.976647] [c0000000189bfac0] [c0000000009eb8e4] dump_stack+0x90/0xbc
  [ 1652.976674] [c0000000189bfaf0] [c0000000009e2b5c] panic+0x104/0x2a8
  [ 1652.976703] [c0000000189bfb80] [c00000000007306c] 
hmi_event_handler+0x19c/0x2b0
  [ 1652.976732] [c0000000189bfc50] [c0000000000d62dc] 
process_one_work+0x1ac/0x4d0
  [ 1652.976772] [c0000000189bfce0] [c0000000000d6b80] worker_thread+0x190/0x630
  [ 1652.976800] [c0000000189bfd80] [c0000000000e0024] kthread+0x114/0x140
  [ 1652.976837] [c0000000189bfe30] [c00000000000a468] 
ret_from_kernel_thread+0x5c/0x74
  [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI 
exception
   

  .
  .
  .
  .
  
|------------------------------------------------------------------------------|
  | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware               
mbox |
  | 0x50227CFC Processed           Predictive Error                     
B182953C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware               
spif |
  | 0x50227CDD Processed           Predictive Error                     
B182951C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware               
mbox |
  | 0x50227CC0 Processed           Predictive Error                     
B182953C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware               
spif |
  | 0x50227C9E Processed           Predictive Error                     
B182951C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware               
mbox |
  | 0x50227C88 Processed           Predictive Error                     
B182953C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware               
spif |
  | 0x50227C7E Processed           Predictive Error                     
B182951C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware               
mbox |
  | 0x50227C4F Processed           Predictive Error                     
B182953C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware               
spif |
  | 0x50227C40 Processed           Predictive Error                     
B182951C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C17 06/14/2015 22:24:42 System Hypervisor Firmware               
mbox |
  | 0x50227C17 Processed           Predictive Error                     
B182953C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x50227C01 06/14/2015 22:24:25 System Hypervisor Firmware               
spif |
  | 0x50227C01 Processed           Predictive Error                     
B182951C | --> Unexpected mail box error , needs investigation 
  
|------------------------------------------------------------------------------|
  | 0x501CF14F 06/11/2015 15:32:31 Processor Unit (CPU)                     
prdf |
  | 0x501CF14F Processed           Predictive Error                     
B113E504 |  --> Error log corresponding to injected Core Unit Checkstop error ( 
Core FIR [5] )
  
|------------------------------------------------------------------------------|

  == Comment: #3 - MAHESH J. SALGAONKAR <mahesh.salgaon...@in.ibm.com> - 
2015-06-16 14:06:56 ==
  Ah! I see that panic timeout is set to 0 (zero). That means kernel will wait 
forever after panic. The behaviour reported in this BUG is as expected when 
panic timeout is set to 0. Hence it is not a BUG.

  ===================
  root@gp4p01:/usr/lib/debug/boot# cat /proc/sys/kernel/panic
  0
  root@gp4p01:/usr/lib/debug/boot# echo 10 > /proc/sys/kernel/panic
  root@gp4p01:/usr/lib/debug/boot# cat /proc/sys/kernel/panic
  10
  root@gp4p01:/usr/lib/debug/boot#

  After setting panic_timeout to 10 seconds I see that system rebooted
  on unrecoverable HMI:

  [salgaonkarm@mars linux-2.6]$ fsp_cmd -i gp4fp1.aus.stglabs.ibm.com
  Checking if system 'gp4fp1.aus.stglabs.ibm.com' is accesible (ping test)
  spawn telnet gp4fp1.aus.stglabs.ibm.com
  Trying 9.3.136.91...
  Connected to gp4fp1.aus.stglabs.ibm.com.
  Escape character is '^]'.

  Linux 2.6.32-279.14.1.69.fsp_fld8_1.ppcnf-fsp2 (gp4fp1) (12:51 on
  Friday, 08 October 2004)

  gp4fp1 login: dev
  Password: 
  $ smgr mfgState
  runtime
  $ 
  $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
  s1.ex k0:n0:s0:p03:c5    
  ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 
  $ smgr mfgState
  runtime
  $ smgr mfgState
  ipling                     <= System rebooting.. 
  ===================

  
  For system to reboot after panic, please set panic timeout to non-zero value 
and try injecting core checkstop again. You can do that in two ways:
  1. Boot kernel with "panic=<secs>" kernel option (See below for valid values)
      OR
  2. Once OS is booted, echo non-zero value to  /proc/sys/kernel/panic
      $ echo 10 >  /proc/sys/kernel/panic


  Please refer to Documentation/kernel-parameters.txt for valid panic
  values:

  
-----------------Documentation/kernel-parameters.txt----------------------------
          panic=          [KNL] Kernel behaviour on panic: delay <timeout>
                          timeout > 0: seconds before rebooting
                          timeout = 0: wait forever
                          timeout < 0: reboot immediately
                          Format: <timeout>
  
-----------------Documentation/kernel-parameters.txt----------------------------

  == Comment: #8 - Stewart Smith <sesm...@au1.ibm.com> - 2015-09-17 20:03:51 ==
  You should be able to provide panic_timeout=180 as a kernel argument as a 
workaround.
  However, this is *completely* an ubuntu bug. Perhaps we want to modify the 
panic() handler in linux though.

  == Comment: #9 - Luciano Chavez <cha...@us.ibm.com> - 2015-09-18 10:47:31 ==
  I was speaking to Feroz and Donna this morning (and sent a note about the 
same thing last night) and to them the issue is not the panic but that it does 
not reboot after it hits it. As explained there are three methods to override 
the panic timeout (add kernel.panic= to sysctl,conf, echo value to 
/proc/sys/kernel/panic, or pass panic= to kernel command line). 

  So, I hope Feroz can chime in but they want the hard coded
  CONFIG_PANIC_TIMEOUT changed because the rationale customers would
  likely not be aware to change it despite the previously mentioned ways
  to do it.

  So, if that is what they want, we will have to send this to Canonical
  for their take on this.

  
  Hi Canconical, 
  Can you please comment  your take on this issue ?

  Thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1512593/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to