Another bcachefs version downgrade bug

Carl E. Thompson Tue, 15 Oct 2024 21:58:41 -0700

Hi,

     I believe there is another newer version downgrade bug in bcachefs (tested 
versions: 6.9.4 <--> 6.11.3).


     My laptop runs kernel 6.9.4 normally with 4 bcachefs filesystems on LVM2 
logical volumes mounted including the root filesystem. I needed to test 
something under 6.11 so I booted kernel 6.11.3 and used the system normally 
from the console (bcachefs worked fine under 6.11.3). After attempting to boot 
back into 6.9.4 my laptop no longer starts and hangs when trying to mount and 
manipulate the root filesystem. The kernel log shows kernel traces due to hung 
copygc tasks (see dmesg output below). This happens every time I try to start 
6.9.4 now. The kernel log reveals that the bcachefs filesystem seems to 
complete the version downgrade and initial mount successfully but it starts 
hanging as soon as the filesystem is used. Booting back into the 6.11.3 kernel 
causes the filesystems to work again but I can't run 6.11 on my laptop normally 
because 6.11 (and 6.10) have amdgpu issues that cause irrecoverable graphical 
desktop lockups. So right now I can either choose to boot with filesystem
 s that don't work or with periodic hard graphical desktop crashes neither of 
which is ideal.

     On my laptop and some of my other computers I boot multiple Linux 
distributions which usually run different kernels and mount the same 
filesystems on all of them (except root). So I do need to be able to switch 
back and forth between kernels as needed on all of my systems and these types 
of issues give me some pause. I will disable bcachefs use on my dev systems and 
servers for now until I am more confident that there is a solid testing plan in 
place to make sure there can be no more of these kind of issues in the future 
when booting multiple kernels. I will keep bcachefs on my laptop for testing. A 
fix for my laptop isn't urgent for me personally as I can recreate the 
filesystems under 6.9.4 and restore from backups. Of course others people might 
need a fix more quickly. Next time I need to boot a different kernel I'll make 
sure to create LVM snapshots of the devices first to which I can revert if 
needed. 

Thanks,
Carl

show-super from one affected filesystem:
---
[clip carl]# bcachefs show-super /dev/clip/root-alpine 
Device:                                     (unknown device)
External UUID:                             c992a5de-c9b3-4fd1-82ed-4d2f66bc11cb
Internal UUID:                             43b4fe97-f5a4-48b3-8d99-3a3dda25211a
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.12: (unknown version)
Version upgrade complete:                  1.12: (unknown version)
Oldest version on disk:                    1.4: member_seq
Created:                                   Fri Mar 22 19:19:01 2024
Sequence number:                           249
Time of last write:                        Tue Oct 15 20:23:34 2024
Superblock size:                           4.45 KiB/1.00 MiB
Clean:                                     0
Devices:                                   1
Sections:                                  
members_v1,replicas_v0,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade
Features:                                  
lz4,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           
alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              4.00 KiB
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro 
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash 
  data_checksum:                           none [crc32c] crc64 xxhash 
  compression:                             lz4
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash] 
  metadata_target:                         none
  foreground_target:                       none
  background_target:                       none
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   0
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none 
  nocow:                                   0

members_v2 (size 160):
Device:                                    0
  Label:                                   (none)
  UUID:                                    352e33b9-dde4-48da-8fe2-255ae78c6320
  Size:                                    24.0 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         2
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             256 KiB
  First bucket:                            0
  Buckets:                                 98304
  Last mount:                              Tue Oct 15 20:23:32 2024
  Last superblock write:                   249
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        1.00 MiB
  Btree allocated bitmap:                  
0000000000000000000000000000011111111111111111111111111111111111
  Durability:                              1
  Discard:                                 1
  Freespace initialized:                   1

errors (size 24):
bset_bad_csum                               1               Sat Jul  6 07:43:37 
2024


dmesg output:
---

...

[  230.456893] bcachefs (dm-7): mounting version 1.12: (unknown version) 
opts=compression=lz4
[  230.456911] bcachefs (dm-7): recovering from clean shutdown, journal seq 4901
[  230.456915] bcachefs (dm-7): Version downgrade required:
[  230.469098] bcachefs (dm-7): alloc_read... done
[  230.469111] bcachefs (dm-7): stripes_read... done
[  230.469115] bcachefs (dm-7): snapshots_read... done
[  230.469436] bcachefs (dm-7): journal_replay... done
[  230.469441] bcachefs (dm-7): resume_logged_ops... done
[  230.469450] bcachefs (dm-7): going read-write
[  368.351326] INFO: task bch-copygc/dm-7:547 blocked for more than 122 seconds.
[  368.351336]       Not tainted 6.9.4-arch1-1 #1
[  368.351338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  368.351340] task:bch-copygc/dm-7 state:D stack:0     pid:547   tgid:547   
ppid:2      flags:0x00004000
[  368.351345] Call Trace:
[  368.351348]  <TASK>
[  368.351354]  __schedule+0x3c7/0x1510
[  368.351368]  schedule+0x27/0xf0
[  368.351372]  __closure_sync+0x7e/0x140
[  368.351382]  __bch2_write+0x136b/0x1660 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351436]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351440]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351441]  ? __kmalloc+0x1a7/0x440
[  368.351446]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351448]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351452]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351454]  ? local_clock_noinstr+0xd/0xd0
[  368.351456]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351457]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351460]  ? bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351489]  bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351511]  ? srso_alias_return_thunk+0x5/0xfbef5
[  368.351512]  ? bch2_btree_path_traverse_one+0x958/0xcf0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351539]  bch2_data_update_init+0x68b/0x1420 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351573]  ? bch2_move_extent+0x3da/0xed0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351602]  bch2_move_extent+0x3da/0xed0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351631]  ? bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351652]  bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351681]  ? bch2_copygc+0x210/0x880 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351702]  bch2_copygc+0x210/0x880 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351732]  bch2_copygc_thread+0x152/0x3d0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351775]  ? bch2_copygc_thread+0xcf/0x3d0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351828]  ? __pfx_bch2_copygc_thread+0x10/0x10 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  368.351868]  kthread+0xcf/0x100
[  368.351876]  ? __pfx_kthread+0x10/0x10
[  368.351882]  ret_from_fork+0x31/0x50
[  368.351889]  ? __pfx_kthread+0x10/0x10
[  368.351894]  ret_from_fork_asm+0x1a/0x30
[  368.351905]  </TASK>
[  491.230894] INFO: task bch-copygc/dm-7:547 blocked for more than 245 seconds.
[  491.230914]       Not tainted 6.9.4-arch1-1 #1
[  491.230920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  491.230924] task:bch-copygc/dm-7 state:D stack:0     pid:547   tgid:547   
ppid:2      flags:0x00004000
[  491.230939] Call Trace:
[  491.230944]  <TASK>
[  491.230955]  __schedule+0x3c7/0x1510
[  491.230984]  schedule+0x27/0xf0
[  491.230993]  __closure_sync+0x7e/0x140
[  491.231011]  __bch2_write+0x136b/0x1660 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231160]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231169]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231174]  ? __kmalloc+0x1a7/0x440
[  491.231186]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231192]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231206]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231211]  ? local_clock_noinstr+0xd/0xd0
[  491.231218]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231223]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231232]  ? bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231340]  bch2_moving_ctxt_do_pending_writes+0x11a/0x220 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231412]  ? srso_alias_return_thunk+0x5/0xfbef5
[  491.231418]  ? bch2_btree_path_traverse_one+0x958/0xcf0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231509]  bch2_data_update_init+0x68b/0x1420 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231625]  ? bch2_move_extent+0x3da/0xed0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231732]  bch2_move_extent+0x3da/0xed0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231823]  ? bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231883]  bch2_evacuate_bucket+0x9d4/0xc00 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.231963]  ? bch2_copygc+0x210/0x880 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.232022]  bch2_copygc+0x210/0x880 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.232089]  bch2_copygc_thread+0x152/0x3d0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.232148]  ? bch2_copygc_thread+0xcf/0x3d0 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.232217]  ? __pfx_bch2_copygc_thread+0x10/0x10 [bcachefs 
8d6f6bf430dcfbb124cd4a016333997e24e1fc8a]
[  491.232271]  kthread+0xcf/0x100
[  491.232282]  ? __pfx_kthread+0x10/0x10
[  491.232289]  ret_from_fork+0x31/0x50
[  491.232298]  ? __pfx_kthread+0x10/0x10
[  491.232304]  ret_from_fork_asm+0x1a/0x30
[  491.232319]  </TASK>

...

Another bcachefs version downgrade bug

Reply via email to