------- Comment From lagar...@br.ibm.com 2018-05-16 10:11 EDT-------
Patch set was resent to Canonical mailing list by Ziviani.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1759723

Title:
  ISST-LTE:KVM:Ubuntu18.04:BostonLC:boslcp3:boslcp3g3:Guest conosle
  hangs after hotplug CPU add operation.

Status in The Ubuntu-power-systems project:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress

Bug description:
  Problem Description:
  ===================
  Performed HOTPLUG cpu attach operation for the guest and guest console 
becomes unresponsive.

  Steps to re-create:
  ==================
  1. updated boslcp3 host BMC :116 & PNOR: 20180302 levels
   
  2. Installed Ubuntu1804 on boslcp3 host & guests with trap issue fixes

  root@boslcp3:/home# uname -a
  Linux boslcp3 4.15.0-12-generic #13+leo20180320 SMP Tue Mar 20 13:10:42 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:/home# uname -r
  4.15.0-12-generic

  root@boslcp3g3:/kte/tools/setup.d# uname -a
  Linux boslcp3g3 4.15.0-12-generic #13+leo20180320 SMP Tue Mar 20 13:10:42 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3g3:/kte/tools/setup.d# uname -r
  4.15.0-12-generic

  3. Started HTX & stress-ng for on guest for 10-15 min

  4. Cleaned up the tests to perform hot-plug and ensure enough memory
  and cpu was there (killed all Process using kill)

  5. Performed cpu hot-plug and guest went into hung state

  Before Hotplug:

  root@boslcp3:~# virsh dumpxml boslcp3g3 | grep vcpu
    <vcpu placemen

  Hotplug add CPU:

  root@boslcp3:~# virsh setvcpus boslcp3g3 48 --live

  dumpxml:

  root@boslcp3:~# virsh dumpxml boslcp3g3 | grep cpu
    <vcpu placement='static' current='48'>64</vcpu>
    <vcpus>
      <vcpu id='0' enabled='yes' hotpluggable='no' order='1'/>
      <vcpu id='1' enabled='yes' hotpluggable='no' order='1'/>
      <vcpu id='2' enabled='yes' hotpluggable='no' order='1'/>
      <vcpu id='3' enabled='yes' hotpluggable='no' order='1'/>
      <vcpu id='4' enabled='yes' hotpluggable='no' order='2'/>
      <vcpu id='5' enabled='yes' hotpluggable='no' order='2'/>
      <vcpu id='6' enabled='yes' hotpluggable='no' order='2'/>
      <vcpu id='7' enabled='yes' hotpluggable='no' order='2'/>
      <vcpu id='8' enabled='yes' hotpluggable='no' order='3'/>
      <vcpu id='9' enabled='yes' hotpluggable='no' order='3'/>
      <vcpu id='10' enabled='yes' hotpluggable='no' order='3'/>
      <vcpu id='11' enabled='yes' hotpluggable='no' order='3'/>
      <vcpu id='12' enabled='yes' hotpluggable='no' order='4'/>
      <vcpu id='13' enabled='yes' hotpluggable='no' order='4'/>
      <vcpu id='14' enabled='yes' hotpluggable='no' order='4'/>
      <vcpu id='15' enabled='yes' hotpluggable='no' order='4'/>
      <vcpu id='16' enabled='yes' hotpluggable='no' order='5'/>
      <vcpu id='17' enabled='yes' hotpluggable='no' order='5'/>
      <vcpu id='18' enabled='yes' hotpluggable='no' order='5'/>
      <vcpu id='19' enabled='yes' hotpluggable='no' order='5'/>
      <vcpu id='20' enabled='yes' hotpluggable='no' order='6'/>
      <vcpu id='21' enabled='yes' hotpluggable='no' order='6'/>
      <vcpu id='22' enabled='yes' hotpluggable='no' order='6'/>
      <vcpu id='23' enabled='yes' hotpluggable='no' order='6'/>
      <vcpu id='24' enabled='yes' hotpluggable='no' order='7'/>
      <vcpu id='25' enabled='yes' hotpluggable='no' order='7'/>
      <vcpu id='26' enabled='yes' hotpluggable='no' order='7'/>
      <vcpu id='27' enabled='yes' hotpluggable='no' order='7'/>
      <vcpu id='28' enabled='yes' hotpluggable='no' order='8'/>
      <vcpu id='29' enabled='yes' hotpluggable='no' order='8'/>
      <vcpu id='30' enabled='yes' hotpluggable='no' order='8'/>
      <vcpu id='31' enabled='yes' hotpluggable='no' order='8'/>
      <vcpu id='32' enabled='yes' hotpluggable='yes' order='9'/>
      <vcpu id='33' enabled='yes' hotpluggable='yes' order='9'/>
      <vcpu id='34' enabled='yes' hotpluggable='yes' order='9'/>
      <vcpu id='35' enabled='yes' hotpluggable='yes' order='9'/>
      <vcpu id='36' enabled='yes' hotpluggable='yes' order='10'/>
      <vcpu id='37' enabled='yes' hotpluggable='yes' order='10'/>
      <vcpu id='38' enabled='yes' hotpluggable='yes' order='10'/>
      <vcpu id='39' enabled='yes' hotpluggable='yes' order='10'/>
      <vcpu id='40' enabled='yes' hotpluggable='yes' order='11'/>
      <vcpu id='41' enabled='yes' hotpluggable='yes' order='11'/>
      <vcpu id='42' enabled='yes' hotpluggable='yes' order='11'/>
      <vcpu id='43' enabled='yes' hotpluggable='yes' order='11'/>
      <vcpu id='44' enabled='yes' hotpluggable='yes' order='12'/>
      <vcpu id='45' enabled='yes' hotpluggable='yes' order='12'/>
      <vcpu id='46' enabled='yes' hotpluggable='yes' order='12'/>
      <vcpu id='47' enabled='yes' hotpluggable='yes' order='12'/>
      <vcpu id='48' enabled='no' hotpluggable='yes'/>
      <vcpu id='49' enabled='no' hotpluggable='yes'/>
      <vcpu id='50' enabled='no' hotpluggable='yes'/>
      <vcpu id='51' enabled='no' hotpluggable='yes'/>
      <vcpu id='52' enabled='no' hotpluggable='yes'/>
      <vcpu id='53' enabled='no' hotpluggable='yes'/>
      <vcpu id='54' enabled='no' hotpluggable='yes'/>
      <vcpu id='55' enabled='no' hotpluggable='yes'/>
      <vcpu id='56' enabled='no' hotpluggable='yes'/>
      <vcpu id='57' enabled='no' hotpluggable='yes'/>
      <vcpu id='58' enabled='no' hotpluggable='yes'/>
      <vcpu id='59' enabled='no' hotpluggable='yes'/>
      <vcpu id='60' enabled='no' hotpluggable='yes'/>
      <vcpu id='61' enabled='no' hotpluggable='yes'/>
      <vcpu id='62' enabled='no' hotpluggable='yes'/>
      <vcpu id='63' enabled='no' hotpluggable='yes'/>
    </vcpus>
    <cpu mode='host-model' check='partial'>
    </cpu>
  root@boslcp3:~#  

  6. After this operation, guest becomes unrepsonsive as below

  
  root@boslcp3g3:~# [ 3626.140773] INFO: task jbd2/vda2-8:584 blocked for more 
than 120 seconds.
  [ 3626.146375]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.146457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.146624] INFO: task systemd-journal:665 blocked for more than 120 
seconds.
  [ 3626.146699]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.146768] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.146939] INFO: task rs:main Q:Reg:1995 blocked for more than 120 
seconds.
  [ 3626.147016]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.147088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.147285] INFO: task kworker/u128:2:57691 blocked for more than 120 
seconds.
  [ 3626.147361]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.147434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.147622] INFO: task smbd:1449 blocked for more than 120 seconds.
  [ 3626.147686]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.147760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.147875] INFO: task smbd:1452 blocked for more than 120 seconds.
  [ 3626.147937]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.148010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.148110] INFO: task smbd:1454 blocked for more than 120 seconds.
  [ 3626.148173]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.148245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3626.148344] INFO: task cron:1461 blocked for more than 120 seconds.
  [ 3626.148406]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3626.148488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.

  root@boslcp3g3:~#
  root@boslcp3g3:~# ps -ef | grep stress-ng
  [ 3746.978098] INFO: task jbd2/vda2-8:584 blocked for more than 120 seconds.
  [ 3746.978221]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3746.978301] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 3746.978447] INFO: task systemd-journal:665 blocked for more than 120 
seconds.
  [ 3746.978534]       Tainted: G        W        4.15.0-12-generic 
#13+leo20180320
  [ 3746.978607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 4446.361899] systemd[1]: Failed to start Journal Service.
  [ 4897.632142] systemd[1]: Failed to start Journal Service.


  ^Z
  ^X
  ^C
  ^Z
  ^X
  ^C

  7. ping to boslcp3g3 is fine but guest console  is not repsonding

  [ipjoga@kte (AUS) ~]$ ping boslcp3g3
  PING boslcp3g3.isst.aus.stglabs.ibm.com (10.33.12.73) 56(84) bytes of data.
  64 bytes from boslcp3g3.isst.aus.stglabs.ibm.com (10.33.12.73): icmp_seq=1 
ttl=64 time=0.182 ms
  64 bytes from boslcp3g3.isst.aus.stglabs.ibm.com (10.33.12.73): icmp_seq=2 
ttl=64 time=0.196 ms
  ^C

  
  8. Took dump for the guest, attache vmcore & other logs.

  Thanks to the Linux block community, I'm now aware of two commits that
  should fix this issue.

  
https://github.com/torvalds/linux/commit/20e4d813931961fe26d26a1e98b3aba6ec00b130

   blk-mq: simplify queue mapping & schedule with each possisble CPU

   The previous patch assigns interrupt vectors to all possible CPUs, so
   now hctx can be mapped to possible CPUs, this patch applies this fact
   to simplify queue mapping & schedule so that we don't need to handle
   CPU hotplug for dealing with physical CPU plug & unplug. With this
   simplication, we can work well on physical CPU plug & unplug, which
   is a normal use case for VM at least.

   Make sure we allocate blk_mq_ctx structures for all possible CPUs, and
   set hctx->numa_node for possible CPUs which are mapped to this hctx. And
   only choose the online CPUs for schedule.

  
https://github.com/torvalds/linux/commit/84676c1f21e8ff54befe985f4f14dc1edc10046b

   genirq/affinity: assign vectors to all possible CPUs

   Currently we assign managed interrupt vectors to all present CPUs.  This
   works fine for systems were we only online/offline CPUs.  But in case of
   systems that support physical CPU hotplug (or the virtualized version of
   it) this means the additional CPUs covered for in the ACPI tables or on
   the command line are not catered for.  To fix this we'd either need to
   introduce new hotplug CPU states just for this case, or we can start
   assining vectors to possible but not present CPUs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1759723/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to