Looking at the log, I noticed the EEH is frozen right after finding the
Broadcom card. Is that one the tg3?

[  OK  ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
[    8.191135] EEH: Frozen PE#7 on PHB#21 detected
[    8.191280] EEH: PE location: S00210f, PHB location: N/A

Also, the recovery problem seems to be caused by ast.

[   18.267005] EEH: 2100000 reads ignored for recovering device at 
location=S00210f driver=ast pci addr=0021:10:00.0
[   18.267334] EEH: Might be infinite loop in ast driver

Looking at the upstream logs, one commit came up. Can you open a new bug
for it?

commit 298360af3dab45659810fdc51aba0c9f4097e4f6
Author: Russell Currey <rus...@russell.cc>
Date:   Thu Dec 15 16:12:41 2016 +1100

    drivers/gpu/drm/ast: Fix infinite loop if read fails
    
    ast_get_dram_info() configures a window in order to access BMC memory.
    A BMC register can be configured to disallow this, and if so, causes
    an infinite loop in the ast driver which renders the system unusable.
    
    Fix this by erroring out if an error is detected.  On powerpc systems with
    EEH, this leads to the device being fenced and the system continuing to
    operate.
    
    Cc: <sta...@vger.kernel.org> # 3.10+
    Signed-off-by: Russell Currey <rus...@russell.cc>
    Reviewed-by: Joel Stanley <j...@jms.id.au>
    Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>
    Link: 
http://patchwork.freedesktop.org/patch/msgid/20161215051241.20815-1-rus...@russell.cc

diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 904beaa932d03..f75c6421db623 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -223,7 +223,8 @@ static int ast_get_dram_info(struct drm_device *dev)
        ast_write32(ast, 0x10000, 0xfc600309);
 
        do {
-               ;
+               if (pci_channel_offline(dev->pdev))
+                       return -EIO;
        } while (ast_read32(ast, 0x10000) != 0x01);
        data = ast_read32(ast, 0x10004);
 
@@ -428,7 +429,9 @@ int ast_driver_load(struct drm_device *dev, unsigned long 
flags)
        ast_detect_chip(dev, &need_post);
 
        if (ast->chip != AST1180) {
-               ast_get_dram_info(dev);
+               ret = ast_get_dram_info(dev);

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  dump is not captured in remote host when kdump over ssh is configured
  on firestone.

Status in The Ubuntu-power-systems project:
  Incomplete
Status in makedumpfile package in Ubuntu:
  New

Bug description:
  == Comment: #0 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
05:00:29 ==
  ---Problem Description---

  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

  ---Steps to Reproduce---

  1. Configure kdump.
  2. Check whether kdump is operational using ?# kdump-config show?.
  3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms.
  4. Setup password less ssh connection, generate rsa key.
  # ssh-keygen -t rsa
  5. verify id_rsa and id_rsa.pub are created under /root/.ssh/
  6. Edit /etc/default/kdump-tools and add below entries.
  SSH="ubuntu@9.114.15.239"
  SSH_KEY=/root/.ssh/id_rsa
  7. Propagate RSA key.
  # kdump-config propagate
  8. Restart kdump service.
  # kdump-config load
  9. Trigger Crash using below commands.
  # echo "1" > /proc/sys/kernel/sysrq
  # echo "c" > /proc/sysrq-trigger
  10. Verify dump is available in remote server in configured path.

  Machine details
  ===========

  $ ipmitool -I lanplus -H  9.47.70.3 -U ADMIN -P admin sol activate

  $ ssh ubuntu@9.47.70.29

  PW: shriya101

  
  Attaching logs

  == Comment: #1 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> -
  2017-03-07 05:01:42 ==

  
  == Comment: #5 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:19:46 ==
  Hi, 

  Attaching the logs.

  Network info:

  root@ltc-firep3:~# hwinfo --network
  36: None 00.0: 10700 Loopback                                   
    [Created at net.126]
    Unique ID: ZsBS.GQNx7L4uPNA
    SysFS ID: /class/net/lo
    Hardware Class: network interface
    Model: "Loopback network interface"
    Device File: lo
    Link detected: yes
    Config Status: cfg=new, avail=yes, need=no, active=unknown

  37: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: 2lHw.ndpeucax6V1
    Parent ID: mIXc.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f2
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f2
    HW Address: 98:be:94:03:18:4a
    Permanent HW Address: 98:be:94:03:18:4a
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #15 (Ethernet controller)

  38: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: 7Onn.ndpeucax6V1
    Parent ID: sx0U.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f0
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f0
    HW Address: 98:be:94:03:18:48
    Permanent HW Address: 98:be:94:03:18:48
    Link detected: yes
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #16 (Ethernet controller)

  39: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: VwX_.ndpeucax6V1
    Parent ID: DUng.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f3
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f3
    HW Address: 98:be:94:03:18:4b
    Permanent HW Address: 98:be:94:03:18:4b
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #25 (Ethernet controller)

  40: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: bZ1s.ndpeucax6V1
    Parent ID: J7HY.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f1
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f1
    HW Address: 98:be:94:03:18:49
    Permanent HW Address: 98:be:94:03:18:49
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #4 (Ethernet controller)
  root@ltc-firep3:~# 


  Thanks,
  Pavithra

  == Comment: #6 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> -
  2017-03-07 23:20:47 ==

  
  == Comment: #7 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:21:27 ==

  
  == Comment: #8 - Urvashi Jawere <urjaw...@in.ibm.com> - 2017-03-08 02:48:15 ==
  I am able to see some errors in syslog ;

  auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not 
support DNSSEC, downgrading to non-DNSSEC mode.
  Mar  7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent 
to ubuntu@9.114.15.239:/home/ubuntu/test
  Mar  7 04:58:04 ltc-firep3 systemd[1]: Reloading.
  Mar  7 04:59:15 ltc-firep3 systemd[1]: Reloading.
  Mar  7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa 
to server ubuntu@9.114.15.239
  .
  .
  .

  Mar  7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service.
  Mar  7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified 
cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll 
nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 
elfcorehdr=155136K
  Mar  7 05:06:57 ltc-firep3 kdump-tools[3498]:  * loaded kdump kernel
  Mar  7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p 
--command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash 
irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img 
/var/lib/kdump/vmlinuz
  Mar  7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel
  Mar  7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture 
service.
  Mar  7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 
17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash'
  Mar  7 05:06:57 ltc-firep3 apport[3584]:    ...done.

  == Comment: #18 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
06:55:20 ==
  Looks like tg3 module was not needed after all. Interesting thing though is
  even after enP34p1s0f0 is up (ifup) and network.online target is reached,
  network was not really active. It took about 30 seconds, after reaching 
  network.online target, for the network to be active, even on a normal boot.
  Adding this wait time in kdump script, before saving dump, ensured that
  vmcore is captured successful. Attaching the log for the same..

  Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so,
  this delay should be part of ifup/network-online.target if it is inevitable,
  so that network is pingable after network-online.target
   
  Thanks
  Hari

  == Comment: #19 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
07:01:52 ==
  The workaround snippet adding delay in kdump script:

  
  --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500
  +++ kdump-config      2017-03-28 06:59:22.887576623 -0500
  @@ -761,6 +761,7 @@
        KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
        ERROR=0
   
  +     sleep 30
        ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
        ERROR=$?
        # If remote connections fails, no need to continue

  ---

  Thanks
  Hari

  == Comment: #20 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-30 
01:33:56 ==
  (In reply to comment #19)
  > The workaround snippet adding delay in kdump script:
  > 
  > 
  > --- kdump-config.orig       2017-03-28 03:35:17.753542107 -0500
  > +++ kdump-config    2017-03-28 06:59:22.887576623 -0500
  > @@ -761,6 +761,7 @@
  >     KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
  >     ERROR=0
  >  
  > +   sleep 30
  >     ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
  >     ERROR=$?
  >     # If remote connections fails, no need to continue
  > 
  > ---
  > 
  > Thanks
  > Hari

  With above workaround dump captured successfully in remote host.

  Thanks,
  Pavithra

  == Comment: #22 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-04-10 
22:14:27 ==
  (In reply to comment #18)
  > Created attachment 117088 [details]
  > Console log of successful dump capture after adding a time delay of 'sleep
  > 30'
  > 
  > Looks like tg3 module was not needed after all. Interesting thing though is
  > even after enP34p1s0f0 is up (ifup) and network.online target is reached,
  > network was not really active. It took about 30 seconds, after reaching 
  > network.online target, for the network to be active, even on a normal boot.
  > Adding this wait time in kdump script, before saving dump, ensured that
  > vmcore is captured successful. Attaching the log for the same..
  > 
  > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even
  > so,
  > this delay should be part of ifup/network-online.target if it is inevitable,
  > so that network is pingable after network-online.target

  Hi Canonical,

  Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME 
field
  in /etc/default/kdump-tools file that defaults to 0 but can be changed when 
the
  user sees timing troubles?

  Thanks
  Hari

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to