I've setup our integration test that runs the the CDO-QA bcache/ceph
setup.

On the updated kernel I got through 10 loops on the deployment before it
stacktraced:

http://paste.ubuntu.com/p/zVrtvKBfCY/

[ 3939.846908] bcache: bch_cached_dev_attach() Caching vdd as bcache5 on set 
275985b3-da58-41f8-9072-958bd960b490
[ 3939.878388] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3939.904984] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3939.972715] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3940.059415] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3940.129854] bcache: register_bcache() error /dev/vdd: device already 
registered (emitting change event)
[ 3949.257051] md: md0: resync done.
[ 4109.273558] INFO: task python3:19635 blocked for more than 120 seconds.
[ 4109.279331]       Tainted: P           O     4.15.0-55-generic #60+lp796292
[ 4109.284767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 4109.288771] python3         D    0 19635  16361 0x00000000
[ 4109.288774] Call Trace:
[ 4109.288818]  __schedule+0x291/0x8a0
[ 4109.288822]  ? __switch_to_asm+0x34/0x70
[ 4109.288824]  ? __switch_to_asm+0x40/0x70
[ 4109.288826]  schedule+0x2c/0x80
[ 4109.288852]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.288866]  ? wait_woken+0x80/0x80
[ 4109.288872]  __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 4109.288876]  bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 4109.288882]  __uuid_write+0x59/0x150 [bcache]
[ 4109.288895]  ? submit_bio+0x73/0x140
[ 4109.288900]  ? __write_super+0x137/0x170 [bcache]
[ 4109.288905]  bch_uuid_write+0x16/0x40 [bcache]
[ 4109.288911]  __cached_dev_store+0x1a1/0x6d0 [bcache]
[ 4109.288916]  bch_cached_dev_store+0x39/0xc0 [bcache]
[ 4109.288992]  sysfs_kf_write+0x3c/0x50
[ 4109.288998]  kernfs_fop_write+0x125/0x1a0
[ 4109.289001]  __vfs_write+0x1b/0x40
[ 4109.289003]  vfs_write+0xb1/0x1a0
[ 4109.289004]  SyS_write+0x55/0xc0
[ 4109.289010]  do_syscall_64+0x73/0x130
[ 4109.289014]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 4109.289016] RIP: 0033:0x7f8d2833e154
[ 4109.289018] RSP: 002b:00007ffcda55a4e8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[ 4109.289020] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f8d2833e154
[ 4109.289021] RDX: 0000000000000008 RSI: 00000000022b7360 RDI: 0000000000000003
[ 4109.289022] RBP: 00007f8d288396c0 R08: 0000000000000000 R09: 0000000000000000
[ 4109.289022] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 4109.289026] R13: 0000000000000000 R14: 00000000022b7360 R15: 0000000001fe8db0
[ 4109.289033] INFO: task bcache_allocato:22317 blocked for more than 120 
seconds.
[ 4109.292172]       Tainted: P           O     4.15.0-55-generic #60+lp796292
[ 4109.295345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 4109.298208] bcache_allocato D    0 22317      2 0x80000000
[ 4109.298212] Call Trace:
[ 4109.298217]  __schedule+0x291/0x8a0
[ 4109.298225]  schedule+0x2c/0x80
[ 4109.298232]  bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.298235]  ? wait_woken+0x80/0x80
[ 4109.298241]  bch_prio_write+0x19f/0x340 [bcache]
[ 4109.298246]  bch_allocator_thread+0x502/0xca0 [bcache]
[ 4109.298260]  kthread+0x121/0x140
[ 4109.298264]  ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
[ 4109.298274]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 4109.298277]  ret_from_fork+0x35/0x40

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1796292

Title:
  Tight timeout for bcache removal causes spurious failures

Status in curtin:
  Fix Released
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  New
Status in linux source package in Cosmic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Confirmed

Bug description:
  I've had a number of deployment faults where curtin would report
  Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
  deployment of 30+ nodes. Upon retrying the node would usually deploy
  fine. Experimentally I've set the timeout ridiculously high, and it
  seems I'm getting no faults with this. I'm wondering if the timeout
  for removal is set too tight, or might need to be made configurable.

  --- curtin/util.py~     2018-05-18 18:40:48.000000000 +0000
  +++ curtin/util.py      2018-10-05 09:40:06.807390367 +0000
  @@ -263,7 +263,7 @@
       return _subp(*args, **kwargs)
   
   
  -def wait_for_removal(path, retries=[1, 3, 5, 7]):
  +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
       if not path:
           raise ValueError('wait_for_removal: missing path parameter')

To manage notifications about this bug go to:
https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to