[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

Ryan Harper Wed, 03 Jul 2019 17:07:16 -0700

Without the patch, I can reproduce the hang fairly frequently, in one or
two loops, which fails in this way:


[ 1069.711956] bcache: cancel_writeback_rate_update_dwork() give up waiting for 
dc->writeback_write_update to quit
[ 1088.583986] INFO: task kworker/0:2:436 blocked for more than 120 seconds.
[ 1088.590330]       Tainted: P           O     4.15.0-54-generic #58-Ubuntu
[ 1088.595831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1088.598210] kworker/0:2     D    0   436      2 0x80000000
[ 1088.598244] Workqueue: events update_writeback_rate [bcache]
[ 1088.598246] Call Trace:
[ 1088.598255]  __schedule+0x291/0x8a0
[ 1088.598258]  ? __switch_to_asm+0x40/0x70
[ 1088.598260]  schedule+0x2c/0x80
[ 1088.598262]  schedule_preempt_disabled+0xe/0x10
[ 1088.598264]  __mutex_lock.isra.2+0x18c/0x4d0
[ 1088.598266]  ? __switch_to_asm+0x34/0x70
[ 1088.598268]  ? __switch_to_asm+0x34/0x70
[ 1088.598270]  ? __switch_to_asm+0x40/0x70
[ 1088.598272]  __mutex_lock_slowpath+0x13/0x20
[ 1088.598274]  ? __mutex_lock_slowpath+0x13/0x20
[ 1088.598276]  mutex_lock+0x2f/0x40
[ 1088.598281]  update_writeback_rate+0x98/0x2b0 [bcache]
[ 1088.598285]  process_one_work+0x1de/0x410
[ 1088.598287]  worker_thread+0x32/0x410
[ 1088.598289]  kthread+0x121/0x140
[ 1088.598291]  ? process_one_work+0x410/0x410
[ 1088.598293]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 1088.598295]  ret_from_fork+0x35/0x40

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~     2018-05-18 18:40:48.000000000 +0000
  +++ curtin/util.py      2018-10-05 09:40:06.807390367 +0000
  @@ -263,7 +263,7 @@
       return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
       if not path:
           raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

Reply via email to