[Kernel-packages] [Bug 2022329] [NEW] EBS volume attachment during boot cause randomly EC2 instance to be stuck

Andrea Cristalli Fri, 02 Jun 2023 04:56:30 -0700

Public bug reported:

We create and deploy custom AMIs based on Ubuntu Jammy and we noticed
since jammy-20230428 that randomly all the AMI based on it sometimes
fail during the boot process.  I can destroy and deploy again to get rid
of this. The stack trace is always the same:


```
[  849.765218] INFO: task swapper/0:1 blocked for more than 727 seconds.
[  849.774999]       Not tainted 5.19.0-1025-aws #26~22.04.1-Ubuntu
[  849.787081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  849.811223] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 
flags:0x00004000
[  849.883494] Call Trace:
[  849.891369]  <TASK>
[  849.899306]  __schedule+0x254/0x5a0
[  849.907878]  schedule+0x5d/0x100
[  849.917136]  io_schedule+0x46/0x80
[  849.970890]  blk_mq_get_tag+0x117/0x300
[  849.976136]  ? destroy_sched_domains_rcu+0x40/0x40
[  849.981442]  __blk_mq_alloc_requests+0xc4/0x1e0
[  849.986750]  blk_mq_get_new_requests+0xcc/0x190
[  849.992185]  blk_mq_submit_bio+0x1eb/0x450
[  850.070689]  __submit_bio+0xf6/0x190
[  850.075545]  submit_bio_noacct_nocheck+0xc2/0x120
[  850.080841]  submit_bio_noacct+0x209/0x560
[  850.085654]  submit_bio+0x40/0xf0
[  850.090361]  submit_bh_wbc+0x134/0x170
[  850.094905]  ll_rw_block+0xbc/0xd0
[  850.175198]  do_readahead.isra.0+0x126/0x1e0
[  850.183531]  jread+0xeb/0x100
[  850.189648]  do_one_pass+0xbb/0xb90
[  850.193917]  ? crypto_create_tfm_node+0x9a/0x120
[  850.207511]  ? crc_43+0x1e/0x1e
[  850.211887]  jbd2_journal_recover+0x8d/0x150
[  850.272927]  jbd2_journal_load+0x130/0x1f0
[  850.280601]  ext4_load_journal+0x271/0x5d0
[  850.288540]  __ext4_fill_super+0x2aa1/0x2e10
[  850.296290]  ? pointer+0x36f/0x500
[  850.304910]  ext4_fill_super+0xd3/0x280
[  850.372470]  ? ext4_fill_super+0xd3/0x280
[  850.380637]  get_tree_bdev+0x189/0x280
[  850.384398]  ? __ext4_fill_super+0x2e10/0x2e10
[  850.388490]  ext4_get_tree+0x15/0x20
[  850.392123]  vfs_get_tree+0x2a/0xd0
[  850.395859]  do_new_mount+0x184/0x2e0
[  850.468151]  path_mount+0x1f3/0x890
[  850.471804]  ? putname+0x5f/0x80
[  850.475341]  init_mount+0x5e/0x9f
[  850.478976]  do_mount_root+0x8d/0x124
[  850.482626]  mount_block_root+0xd8/0x1ea
[  850.486368]  mount_root+0x62/0x6e
[  850.568079]  prepare_namespace+0x13f/0x19e
[  850.571984]  kernel_init_freeable+0x120/0x139
[  850.575930]  ? rest_init+0xe0/0xe0
[  850.579511]  kernel_init+0x1b/0x170
[  850.583084]  ? rest_init+0xe0/0xe0
[  850.586642]  ret_from_fork+0x22/0x30
[  850.668205]  </TASK>
```
This happens since 5.19.0-1024-aws, I have now rolled back to 5.19.0-1022-aws.

I can easily reproduce in my dev environment where 15 EC2 instances are
deployed  at once, 1 or 2 of them randomly fails, next deployment will
succeed.

I cannot debug more because of lacking connection, I could copy the
trace above thanks to the virtual serial console. If there are some boot
options I could add in order to gives more information I will happy to
do that.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Incomplete


** Tags: aws ebs jammy ubuntu

** Description changed:

  We create and deploy custom AMIs based on Ubuntu Jammy and we noticed
  since jammy-20230428 that randomly all the AMI based on it sometimes
  fail during the boot process.  I can destroy and deploy again to get rid
  of this. The stack trace is always the same:
  
  ```
  [  849.765218] INFO: task swapper/0:1 blocked for more than 727 seconds.
  [  849.774999]       Not tainted 5.19.0-1025-aws #26~22.04.1-Ubuntu
  [  849.787081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [  849.811223] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 
flags:0x00004000
  [  849.883494] Call Trace:
  [  849.891369]  <TASK>
  [  849.899306]  __schedule+0x254/0x5a0
  [  849.907878]  schedule+0x5d/0x100
  [  849.917136]  io_schedule+0x46/0x80
  [  849.970890]  blk_mq_get_tag+0x117/0x300
  [  849.976136]  ? destroy_sched_domains_rcu+0x40/0x40
  [  849.981442]  __blk_mq_alloc_requests+0xc4/0x1e0
  [  849.986750]  blk_mq_get_new_requests+0xcc/0x190
  [  849.992185]  blk_mq_submit_bio+0x1eb/0x450
  [  850.070689]  __submit_bio+0xf6/0x190
  [  850.075545]  submit_bio_noacct_nocheck+0xc2/0x120
  [  850.080841]  submit_bio_noacct+0x209/0x560
  [  850.085654]  submit_bio+0x40/0xf0
  [  850.090361]  submit_bh_wbc+0x134/0x170
  [  850.094905]  ll_rw_block+0xbc/0xd0
  [  850.175198]  do_readahead.isra.0+0x126/0x1e0
  [  850.183531]  jread+0xeb/0x100
  [  850.189648]  do_one_pass+0xbb/0xb90
  [  850.193917]  ? crypto_create_tfm_node+0x9a/0x120
  [  850.207511]  ? crc_43+0x1e/0x1e
  [  850.211887]  jbd2_journal_recover+0x8d/0x150
  [  850.272927]  jbd2_journal_load+0x130/0x1f0
  [  850.280601]  ext4_load_journal+0x271/0x5d0
  [  850.288540]  __ext4_fill_super+0x2aa1/0x2e10
  [  850.296290]  ? pointer+0x36f/0x500
  [  850.304910]  ext4_fill_super+0xd3/0x280
  [  850.372470]  ? ext4_fill_super+0xd3/0x280
  [  850.380637]  get_tree_bdev+0x189/0x280
  [  850.384398]  ? __ext4_fill_super+0x2e10/0x2e10
  [  850.388490]  ext4_get_tree+0x15/0x20
  [  850.392123]  vfs_get_tree+0x2a/0xd0
  [  850.395859]  do_new_mount+0x184/0x2e0
  [  850.468151]  path_mount+0x1f3/0x890
  [  850.471804]  ? putname+0x5f/0x80
  [  850.475341]  init_mount+0x5e/0x9f
  [  850.478976]  do_mount_root+0x8d/0x124
  [  850.482626]  mount_block_root+0xd8/0x1ea
  [  850.486368]  mount_root+0x62/0x6e
  [  850.568079]  prepare_namespace+0x13f/0x19e
  [  850.571984]  kernel_init_freeable+0x120/0x139
  [  850.575930]  ? rest_init+0xe0/0xe0
  [  850.579511]  kernel_init+0x1b/0x170
  [  850.583084]  ? rest_init+0xe0/0xe0
  [  850.586642]  ret_from_fork+0x22/0x30
  [  850.668205]  </TASK>
  ```
- This happens since 5.19.0-1024-aws, I have now rolled back to 
5.19.0-1022-aws. 
+ This happens since 5.19.0-1024-aws, I have now rolled back to 5.19.0-1022-aws.
  
+ I can easily reproduce in my dev environment where 15 EC2 instances are
+ deployed  at once, 1 or 2 of them randomly fails, next deployment will
+ succeed.
  
- I cannot debug more because of lacking connection, I could copy the trace 
above thanks to the virtual serial console. If there are some boot options I 
could add in order to gives more information I will happy to do that.
+ I cannot debug more because of lacking connection, I could copy the
+ trace above thanks to the virtual serial console. If there are some boot
+ options I could add in order to gives more information I will happy to
+ do that.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2022329

Title:
  EBS volume attachment during boot cause randomly EC2 instance to be
  stuck

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We create and deploy custom AMIs based on Ubuntu Jammy and we noticed
  since jammy-20230428 that randomly all the AMI based on it sometimes
  fail during the boot process.  I can destroy and deploy again to get
  rid of this. The stack trace is always the same:

  ```
  [  849.765218] INFO: task swapper/0:1 blocked for more than 727 seconds.
  [  849.774999]       Not tainted 5.19.0-1025-aws #26~22.04.1-Ubuntu
  [  849.787081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [  849.811223] task:swapper/0       state:D stack:    0 pid:    1 ppid:     0 
flags:0x00004000
  [  849.883494] Call Trace:
  [  849.891369]  <TASK>
  [  849.899306]  __schedule+0x254/0x5a0
  [  849.907878]  schedule+0x5d/0x100
  [  849.917136]  io_schedule+0x46/0x80
  [  849.970890]  blk_mq_get_tag+0x117/0x300
  [  849.976136]  ? destroy_sched_domains_rcu+0x40/0x40
  [  849.981442]  __blk_mq_alloc_requests+0xc4/0x1e0
  [  849.986750]  blk_mq_get_new_requests+0xcc/0x190
  [  849.992185]  blk_mq_submit_bio+0x1eb/0x450
  [  850.070689]  __submit_bio+0xf6/0x190
  [  850.075545]  submit_bio_noacct_nocheck+0xc2/0x120
  [  850.080841]  submit_bio_noacct+0x209/0x560
  [  850.085654]  submit_bio+0x40/0xf0
  [  850.090361]  submit_bh_wbc+0x134/0x170
  [  850.094905]  ll_rw_block+0xbc/0xd0
  [  850.175198]  do_readahead.isra.0+0x126/0x1e0
  [  850.183531]  jread+0xeb/0x100
  [  850.189648]  do_one_pass+0xbb/0xb90
  [  850.193917]  ? crypto_create_tfm_node+0x9a/0x120
  [  850.207511]  ? crc_43+0x1e/0x1e
  [  850.211887]  jbd2_journal_recover+0x8d/0x150
  [  850.272927]  jbd2_journal_load+0x130/0x1f0
  [  850.280601]  ext4_load_journal+0x271/0x5d0
  [  850.288540]  __ext4_fill_super+0x2aa1/0x2e10
  [  850.296290]  ? pointer+0x36f/0x500
  [  850.304910]  ext4_fill_super+0xd3/0x280
  [  850.372470]  ? ext4_fill_super+0xd3/0x280
  [  850.380637]  get_tree_bdev+0x189/0x280
  [  850.384398]  ? __ext4_fill_super+0x2e10/0x2e10
  [  850.388490]  ext4_get_tree+0x15/0x20
  [  850.392123]  vfs_get_tree+0x2a/0xd0
  [  850.395859]  do_new_mount+0x184/0x2e0
  [  850.468151]  path_mount+0x1f3/0x890
  [  850.471804]  ? putname+0x5f/0x80
  [  850.475341]  init_mount+0x5e/0x9f
  [  850.478976]  do_mount_root+0x8d/0x124
  [  850.482626]  mount_block_root+0xd8/0x1ea
  [  850.486368]  mount_root+0x62/0x6e
  [  850.568079]  prepare_namespace+0x13f/0x19e
  [  850.571984]  kernel_init_freeable+0x120/0x139
  [  850.575930]  ? rest_init+0xe0/0xe0
  [  850.579511]  kernel_init+0x1b/0x170
  [  850.583084]  ? rest_init+0xe0/0xe0
  [  850.586642]  ret_from_fork+0x22/0x30
  [  850.668205]  </TASK>
  ```
  This happens since 5.19.0-1024-aws, I have now rolled back to 5.19.0-1022-aws.

  I can easily reproduce in my dev environment where 15 EC2 instances
  are deployed  at once, 1 or 2 of them randomly fails, next deployment
  will succeed.

  I cannot debug more because of lacking connection, I could copy the
  trace above thanks to the virtual serial console. If there are some
  boot options I could add in order to gives more information I will
  happy to do that.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2022329/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2022329] [NEW] EBS volume attachment during boot cause randomly EC2 instance to be stuck

Reply via email to