------- Comment From mu...@br.ibm.com 2018-03-06 09:38 EDT-------
Cascardo, I gave a try with kdump in Ubuntu 16.04 and it seems to occasionally 
fail.

It seems to be kind of random when it decides to fail, I see that we are
hitting an EEH in slots behind a PLX switch, but even in successful
attempts we hit the EEH as well. So not sure how much it is related. I
collected the console log of the failure attempt (I am attaching it).

I am attempting to drop into a shell by setting sh or bash to run in the
KDUMP_FAIL_CMD option, but it seems to just hang and not give me a
console.I want to collect more logs and see if the adapter is able to
reach the peer to see if it is possibly a timing issue. Is there a
proper way to drop to a shell in case of a kdump failure?

Also I am attempting to reinstall Ubuntu 18.04 to reattempt kdump a few
more times to make sure I didn't get lucky, but I am hitting IBM Bug
165336 - Canonical LP 1753449, which is preventing from reinstalling
Ubuntu 18.04.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to makedumpfile in Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  dump is not captured in remote host when kdump over ssh is configured
  on firestone.

Status in The Ubuntu-power-systems project:
  Incomplete
Status in makedumpfile package in Ubuntu:
  New

Bug description:
  == Comment: #0 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
05:00:29 ==
  ---Problem Description---

  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

  ---Steps to Reproduce---

  1. Configure kdump.
  2. Check whether kdump is operational using ?# kdump-config show?.
  3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms.
  4. Setup password less ssh connection, generate rsa key.
  # ssh-keygen -t rsa
  5. verify id_rsa and id_rsa.pub are created under /root/.ssh/
  6. Edit /etc/default/kdump-tools and add below entries.
  SSH="ubuntu@9.114.15.239"
  SSH_KEY=/root/.ssh/id_rsa
  7. Propagate RSA key.
  # kdump-config propagate
  8. Restart kdump service.
  # kdump-config load
  9. Trigger Crash using below commands.
  # echo "1" > /proc/sys/kernel/sysrq
  # echo "c" > /proc/sysrq-trigger
  10. Verify dump is available in remote server in configured path.

  Machine details
  ===========

  $ ipmitool -I lanplus -H  9.47.70.3 -U ADMIN -P admin sol activate

  $ ssh ubuntu@9.47.70.29

  PW: shriya101

  
  Attaching logs

  == Comment: #1 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> -
  2017-03-07 05:01:42 ==

  
  == Comment: #5 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:19:46 ==
  Hi, 

  Attaching the logs.

  Network info:

  root@ltc-firep3:~# hwinfo --network
  36: None 00.0: 10700 Loopback                                   
    [Created at net.126]
    Unique ID: ZsBS.GQNx7L4uPNA
    SysFS ID: /class/net/lo
    Hardware Class: network interface
    Model: "Loopback network interface"
    Device File: lo
    Link detected: yes
    Config Status: cfg=new, avail=yes, need=no, active=unknown

  37: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: 2lHw.ndpeucax6V1
    Parent ID: mIXc.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f2
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f2
    HW Address: 98:be:94:03:18:4a
    Permanent HW Address: 98:be:94:03:18:4a
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #15 (Ethernet controller)

  38: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: 7Onn.ndpeucax6V1
    Parent ID: sx0U.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f0
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f0
    HW Address: 98:be:94:03:18:48
    Permanent HW Address: 98:be:94:03:18:48
    Link detected: yes
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #16 (Ethernet controller)

  39: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: VwX_.ndpeucax6V1
    Parent ID: DUng.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f3
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f3
    HW Address: 98:be:94:03:18:4b
    Permanent HW Address: 98:be:94:03:18:4b
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #25 (Ethernet controller)

  40: None 00.0: 10701 Ethernet
    [Created at net.126]
    Unique ID: bZ1s.ndpeucax6V1
    Parent ID: J7HY.aXC4wIvegH8
    SysFS ID: /class/net/enP33p3s0f1
    SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1
    Hardware Class: network interface
    Model: "Ethernet network interface"
    Driver: "tg3"
    Driver Modules: "tg3"
    Device File: enP33p3s0f1
    HW Address: 98:be:94:03:18:49
    Permanent HW Address: 98:be:94:03:18:49
    Link detected: no
    Config Status: cfg=new, avail=yes, need=no, active=unknown
    Attached to: #4 (Ethernet controller)
  root@ltc-firep3:~# 


  Thanks,
  Pavithra

  == Comment: #6 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> -
  2017-03-07 23:20:47 ==

  
  == Comment: #7 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-07 
23:21:27 ==

  
  == Comment: #8 - Urvashi Jawere <urjaw...@in.ibm.com> - 2017-03-08 02:48:15 ==
  I am able to see some errors in syslog ;

  auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary
  Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not 
support DNSSEC, downgrading to non-DNSSEC mode.
  Mar  7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent 
to ubuntu@9.114.15.239:/home/ubuntu/test
  Mar  7 04:58:04 ltc-firep3 systemd[1]: Reloading.
  Mar  7 04:59:15 ltc-firep3 systemd[1]: Reloading.
  Mar  7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa 
to server ubuntu@9.114.15.239
  .
  .
  .

  Mar  7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service.
  Mar  7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified 
cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll 
nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 
elfcorehdr=155136K
  Mar  7 05:06:57 ltc-firep3 kdump-tools[3498]:  * loaded kdump kernel
  Mar  7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p 
--command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash 
irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img 
/var/lib/kdump/vmlinuz
  Mar  7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel
  Mar  7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture 
service.
  Mar  7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 
17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash'
  Mar  7 05:06:57 ltc-firep3 apport[3584]:    ...done.

  == Comment: #18 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
06:55:20 ==
  Looks like tg3 module was not needed after all. Interesting thing though is
  even after enP34p1s0f0 is up (ifup) and network.online target is reached,
  network was not really active. It took about 30 seconds, after reaching 
  network.online target, for the network to be active, even on a normal boot.
  Adding this wait time in kdump script, before saving dump, ensured that
  vmcore is captured successful. Attaching the log for the same..

  Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so,
  this delay should be part of ifup/network-online.target if it is inevitable,
  so that network is pingable after network-online.target
   
  Thanks
  Hari

  == Comment: #19 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-03-28 
07:01:52 ==
  The workaround snippet adding delay in kdump script:

  
  --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500
  +++ kdump-config      2017-03-28 06:59:22.887576623 -0500
  @@ -761,6 +761,7 @@
        KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
        ERROR=0
   
  +     sleep 30
        ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
        ERROR=$?
        # If remote connections fails, no need to continue

  ---

  Thanks
  Hari

  == Comment: #20 - PAVITHRA R. PRAKASH <pavra...@in.ibm.com> - 2017-03-30 
01:33:56 ==
  (In reply to comment #19)
  > The workaround snippet adding delay in kdump script:
  > 
  > 
  > --- kdump-config.orig       2017-03-28 03:35:17.753542107 -0500
  > +++ kdump-config    2017-03-28 06:59:22.887576623 -0500
  > @@ -761,6 +761,7 @@
  >     KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
  >     ERROR=0
  >  
  > +   sleep 30
  >     ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
  >     ERROR=$?
  >     # If remote connections fails, no need to continue
  > 
  > ---
  > 
  > Thanks
  > Hari

  With above workaround dump captured successfully in remote host.

  Thanks,
  Pavithra

  == Comment: #22 - Hari Krishna Bathini <hbath...@in.ibm.com> - 2017-04-10 
22:14:27 ==
  (In reply to comment #18)
  > Created attachment 117088 [details]
  > Console log of successful dump capture after adding a time delay of 'sleep
  > 30'
  > 
  > Looks like tg3 module was not needed after all. Interesting thing though is
  > even after enP34p1s0f0 is up (ifup) and network.online target is reached,
  > network was not really active. It took about 30 seconds, after reaching 
  > network.online target, for the network to be active, even on a normal boot.
  > Adding this wait time in kdump script, before saving dump, ensured that
  > vmcore is captured successful. Attaching the log for the same..
  > 
  > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even
  > so,
  > this delay should be part of ifup/network-online.target if it is inevitable,
  > so that network is pingable after network-online.target

  Hi Canonical,

  Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME 
field
  in /etc/default/kdump-tools file that defaults to 0 but can be changed when 
the
  user sees timing troubles?

  Thanks
  Hari

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to