** Tags added: sru-20210621

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1931325

Title:
  cfs_bandwidth01 in sched from ubuntu_ltp_stable failed on B-4.15

Status in ubuntu-kernel-tests:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress

Bug description:
  [Impact]
  Test case cfs_bandwidth01 in LTP sched test suite is a reproducer
  of a CFS unthrottle_cfs_rq() issue (fe61468b2cbc2b sched/fair: Fix
  enqueue_task_fair warning).

  This test triggers a warning on our 4.15 kernel:
   LTP: starting cfs_bandwidth01 (cfs_bandwidth01 -i 5)
   ------------[ cut here ]------------
   rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
   WARNING: CPU: 0 PID: 0 at 
/build/linux-fYK9kF/linux-4.15.0/kernel/sched/fair.c:393 
unthrottle_cfs_rq+0x16f/0x200
   Modules linked in: input_leds joydev serio_raw mac_hid qemu_fw_cfg kvm_intel 
kvm irqbypass sch_fq_codel binfmt_misc ib_iser rdma_cm iw_cm ib_cm ib_core 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss nfs_acl 
lockd grace sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops drm psmouse virtio_blk pata_acpi floppy 
virtio_net i2c_piix4
   CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-144-generic #148-Ubuntu
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 
04/01/2014
   RIP: 0010:unthrottle_cfs_rq+0x16f/0x200
   RSP: 0018:ffff989ebfc03e80 EFLAGS: 00010082
   RAX: 0000000000000000 RBX: ffff989eb4c6ac00 RCX: 0000000000000000
   RDX: 0000000000000005 RSI: ffffffffacb63c4d RDI: 0000000000000046
   RBP: ffff989ebfc03ea8 R08: 000000af39e61b33 R09: ffffffffacb63c20
   R10: 0000000000000000 R11: 0000000000000001 R12: ffff989eb57fe400
   R13: ffff989ebfc21900 R14: 0000000000000001 R15: 0000000000000001
   FS: 0000000000000000(0000) GS:ffff989ebfc00000(0000) knlGS:0000000000000000
   CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 000055593258d618 CR3: 000000007a044000 CR4: 00000000000006f0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
   <IRQ>
   distribute_cfs_runtime+0xc3/0x110
   sched_cfs_period_timer+0xff/0x220
   ? sched_cfs_slack_timer+0xd0/0xd0
   __hrtimer_run_queues+0xdf/0x230
   hrtimer_interrupt+0xa0/0x1d0
   smp_apic_timer_interrupt+0x6f/0x140
   apic_timer_interrupt+0x90/0xa0
   </IRQ>
   RIP: 0010:native_safe_halt+0x12/0x20
   RSP: 0018:ffffffffac603e28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
   RAX: ffffffffabbc9280 RBX: 0000000000000000 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
   RBP: ffffffffac603e28 R08: 000000af39850067 R09: ffff989e73749d00
   R10: 0000000000000000 R11: 7fffffffffffffff R12: 0000000000000000
   R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   ? __sched_text_end+0x1/0x1
   default_idle+0x20/0x100
   arch_cpu_idle+0x15/0x20
   default_idle_call+0x23/0x30
   do_idle+0x172/0x1f0
   cpu_startup_entry+0x73/0x80
   rest_init+0xae/0xb0
   start_kernel+0x4dc/0x500
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0x74/0x77
   secondary_startup_64+0xa5/0xb0
   Code: 50 09 00 00 49 39 85 60 09 00 00 74 68 80 3d 3a 6e 54 01 00 75 5f 31 
db 48 c7 c7 c0 3d 2d ac c6 05 28 6e 54 01 01 e8 11 36 fc ff <0f> 0b 48 85 db 74 
43 49 8b 85 78 09 00 00 49 39 85 70 09 00 00
   ---[ end trace b6b9a70bc2945c0c ]---

  [Fix]
  Base on the test case description, we will need these fixes:
    * fe61468b2cbc2b sched/fair: Fix enqueue_task_fair warning
    * b34cb07dde7c23 sched/fair: Fix enqueue_task_fair() warning some more
    * 39f23ce07b9355 sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
    * 6d4d22468dae3d sched/fair: Reorder enqueue/dequeue_task_fair path
    * 5ab297bab98431 sched/fair: Fix reordering of enqueue/dequeue_task_fair()

  Backport needed for Bionic since we're missing some new variables /
  coding style changes introduced in the following commits (and their
  corresponding fixes):
    * 97fb7a0a8944bd sched: Clean up and harmonize the coding style of the 
scheduler code base
    * 9f68395333ad7f sched/pelt: Add a new runnable average signal
    * 6212437f0f6043 sched/fair: Fix runnable_avg for throttled cfs
    * 43e9f7f231e40e sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq

  I have also searched in the upstream tree to see if there is any other
  commit claim to be a fix of these but didn't see any.

  [Test]
  Test kernel can be found here:
  https://people.canonical.com/~phlin/kernel/lp-1931325-cfs_bandwidth01/

  With these patches applied, the test can pass without triggering this
  warning.

  <<<test_start>>>
  tag=cfs_bandwidth01 stime=1624260713
  cmdline="cfs_bandwidth01 -i 5"
  contacts=""
  analysis=exit
  <<<test_output>>>
  incrementing stop
  tst_test.c:1313: TINFO: Timeout per run is 0h 05m 00s
  tst_buffers.c:55: TINFO: Test is using guarded buffers
  cfs_bandwidth01.c:49: TINFO: Set 'worker1/cpu.max' = '3000 10000'
  cfs_bandwidth01.c:49: TINFO: Set 'worker2/cpu.max' = '2000 10000'
  cfs_bandwidth01.c:49: TINFO: Set 'worker3/cpu.max' = '3000 10000'
  cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
  cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
  cfs_bandwidth01.c:125: TPASS: Workers exited
  cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
  cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
  cfs_bandwidth01.c:125: TPASS: Workers exited
  cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
  cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
  cfs_bandwidth01.c:125: TPASS: Workers exited
  cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
  cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
  cfs_bandwidth01.c:125: TPASS: Workers exited
  cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
  cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
  cfs_bandwidth01.c:125: TPASS: Workers exited

  Summary:
  passed 10
  failed 0
  broken 0
  skipped 0
  warnings 0

  I have also run the whole sched test suite in LTP to make sure there
  is no other issues caused by this patchset.

  [Where problems could occur]
  * CFS (Completely Fair Scheduler) is the process scheduling system in
  the kernel, if the patch is incorrect it might affect the sched
  functionality. Especially system with CONFIG_FAIR_GROUP_SCHED and
  CONFIG_CFS_BANDWIDTH enabled.

  [Other Info]
  Test case description:
   * Creates a multi-level CGroup hierarchy with the cpu controller
   * enabled. The leaf groups are populated with "busy" processes which
   * simulate intermittent cpu load. They spin for some time then sleep
   * then repeat.
   *
   * Both the trunk and leaf groups are set cpu bandwidth limits. The
   * busy processes will intermittently exceed these limits. Causing
   * them to be throttled. When they begin sleeping this will then cause
   * them to be unthrottle.

  [Original Bug Report]
  Issue found on Azure 4.15.0-1116.129 Bionic

  This is a new test case added 8 days ago. So this is not a regression:
  
https://github.com/linux-test-project/ltp/commit/d7b2d7cea71feaea1f27cae185abfe39f238d649

  Test log:

   cfs_bandwidth01.c:49: TINFO: Set 'worker1/cpu.max' = '3000 10000'
   cfs_bandwidth01.c:49: TINFO: Set 'worker2/cpu.max' = '2000 10000'
   cfs_bandwidth01.c:49: TINFO: Set 'worker3/cpu.max' = '3000 10000'
   cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
   cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
   cfs_bandwidth01.c:125: TPASS: Workers exited
   cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
   cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
   cfs_bandwidth01.c:125: TPASS: Workers exited
   cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
   cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
   cfs_bandwidth01.c:125: TPASS: Workers exited
   cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
   cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
   cfs_bandwidth01.c:125: TPASS: Workers exited
   cfs_bandwidth01.c:113: TPASS: Scheduled bandwidth constrained workers
   cfs_bandwidth01.c:49: TINFO: Set 'level2/cpu.max' = '5000 10000'
   cfs_bandwidth01.c:125: TPASS: Workers exited
   tst_test.c:1349: TFAIL: Kernel is now tainted.

   HINT: You _MAY_ be missing kernel fixes, see:

   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39f23ce07b93
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b34cb07dde7c
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fe61468b2cbc
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ab297bab984
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4d22468dae

   Summary:
   passed 10
   failed 1
   broken 0
   skipped 0
   warnings 0
   tag=cfs_bandwidth01 stime=1622734092 dur=15 exit=exited stat=1 core=no 
cu=478 cs=303

  Note that this test has failed on the following instances:
    * Basic_A2
    * Standard_B1ms
    * Standard_D48_v3
    * Standard_F32s_v2

  But passed with:
    * Standard_DS15_v2
    * Standard_DS5_v2

  When this issue happens, kernel will be tainted with warning:
  [  694.761774] LTP: starting cfs_bandwidth01 (cfs_bandwidth01 -i 5)
  [  695.801637] ------------[ cut here ]------------
  [  695.801640] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
  [  695.801694] WARNING: CPU: 0 PID: 0 at 
/build/linux-fYK9kF/linux-4.15.0/kernel/sched/fair.c:393 
unthrottle_cfs_rq+0x16f/0x200
  [  695.801695] Modules linked in: input_leds joydev serio_raw mac_hid 
qemu_fw_cfg kvm_intel kvm irqbypass sch_fq_codel binfmt_misc ib_iser rdma_cm 
iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs 
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm psmouse 
virtio_blk pata_acpi floppy virtio_net i2c_piix4
  [  695.801726] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-144-generic 
#148-Ubuntu
  [  695.801727] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
  [  695.801729] RIP: 0010:unthrottle_cfs_rq+0x16f/0x200
  [  695.801730] RSP: 0018:ffff989ebfc03e80 EFLAGS: 00010082
  [  695.801732] RAX: 0000000000000000 RBX: ffff989eb4c6ac00 RCX: 
0000000000000000
  [  695.801732] RDX: 0000000000000005 RSI: ffffffffacb63c4d RDI: 
0000000000000046
  [  695.801734] RBP: ffff989ebfc03ea8 R08: 000000af39e61b33 R09: 
ffffffffacb63c20
  [  695.801734] R10: 0000000000000000 R11: 0000000000000001 R12: 
ffff989eb57fe400
  [  695.801735] R13: ffff989ebfc21900 R14: 0000000000000001 R15: 
0000000000000001
  [  695.801737] FS:  0000000000000000(0000) GS:ffff989ebfc00000(0000) 
knlGS:0000000000000000
  [  695.801738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  695.801739] CR2: 000055593258d618 CR3: 000000007a044000 CR4: 
00000000000006f0
  [  695.801742] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
  [  695.801743] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
  [  695.801744] Call Trace:
  [  695.801755]  <IRQ>
  [  695.801765]  distribute_cfs_runtime+0xc3/0x110
  [  695.801767]  sched_cfs_period_timer+0xff/0x220
  [  695.801768]  ? sched_cfs_slack_timer+0xd0/0xd0
  [  695.801775]  __hrtimer_run_queues+0xdf/0x230
  [  695.801777]  hrtimer_interrupt+0xa0/0x1d0
  [  695.801786]  smp_apic_timer_interrupt+0x6f/0x140
  [  695.801789]  apic_timer_interrupt+0x90/0xa0
  [  695.801789]  </IRQ>
  [  695.801791] RIP: 0010:native_safe_halt+0x12/0x20
  [  695.801792] RSP: 0018:ffffffffac603e28 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff11
  [  695.801793] RAX: ffffffffabbc9280 RBX: 0000000000000000 RCX: 
0000000000000000
  [  695.801794] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
  [  695.801795] RBP: ffffffffac603e28 R08: 000000af39850067 R09: 
ffff989e73749d00
  [  695.801796] R10: 0000000000000000 R11: 7fffffffffffffff R12: 
0000000000000000
  [  695.801796] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
  [  695.801801]  ? __sched_text_end+0x1/0x1
  [  695.801804]  default_idle+0x20/0x100
  [  695.801813]  arch_cpu_idle+0x15/0x20
  [  695.801814]  default_idle_call+0x23/0x30
  [  695.801821]  do_idle+0x172/0x1f0
  [  695.801823]  cpu_startup_entry+0x73/0x80
  [  695.801825]  rest_init+0xae/0xb0
  [  695.801843]  start_kernel+0x4dc/0x500
  [  695.801845]  x86_64_start_reservations+0x24/0x26
  [  695.801847]  x86_64_start_kernel+0x74/0x77
  [  695.801851]  secondary_startup_64+0xa5/0xb0
  [  695.801852] Code: 50 09 00 00 49 39 85 60 09 00 00 74 68 80 3d 3a 6e 54 01 
00 75 5f 31 db 48 c7 c7 c0 3d 2d ac c6 05 28 6e 54 01 01 e8 11 36 fc ff <0f> 0b 
48 85 db 74 43 49 8b 85 78 09 00 00 49 39 85 70 09 00 00
  [  695.801875] ---[ end trace b6b9a70bc2945c0c ]---

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1931325/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to