It took a while but I managed to get hold of another one of these arm32 boards. Very disappointingly this exact "bitflip" is still present (log enclosed).
To summarise, as it's been a while: - When running scrub, a "page_start" and "eb_start" mismatch is detected (off by a single bit). - `btrfs check` reports no significant errors on aarch64 or arm32. - `btrfs scrub` completes successfully on aarch64! - Now, I can confirm that `btrfs scrub` fails in the same manner on two arm32 machines. Not really sure where to go from here. The only things I can think of are: - Bug in `btrfs scrub` on arm32 (seems unlikely given the single-bit error here). - A bitflip on the original machine wrote bad data to the disk, and somehow only the arm32 implementation catches this... I can give a newer kernel a try (I'm on 5.4.80 at the moment) I suppose Best, Joe ``` $ dmesg [ 258.791966] BTRFS info (device sda1): scrub: started on devid 4 [ 258.792030] BTRFS info (device sda1): scrub: started on devid 1 the magic generation gap you mention is 15, I'm more suspicious about me [ 258.792062] BTRFS info (device sda1): scrub: started on devid 2 [ 411.849669] ------------[ cut here ]------------ [ 411.849786] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531 btree_csum_one_bio+0x22c/0x278 [btrfs] [ 411.849789] Modules linked in: cfg80211 rfkill 8021q ip6table_nat iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter phy_generic uio_pdrv_genirq uio sch_fq_codel loop tun tap macvlan bridge stp llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq [ 411.849836] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.80 #1-NixOS [ 411.849838] Hardware name: Marvell Armada 380/385 (Device Tree) [ 411.849840] Backtrace: [ 411.849850] [<c010f698>] (dump_backtrace) from [<c010f938>] (show_stack+0x20/0x24) [ 411.849855] r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c0c4 [ 411.849861] [<c010f918>] (show_stack) from [<c0a1bb28>] (dump_stack+0x98/0xac) [ 411.849866] [<c0a1ba90>] (dump_stack) from [<c012a998>] (__warn+0xe0/0x108) [ 411.849870] r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990 [ 411.849875] [<c012a8b8>] (__warn) from [<c012ad24>] (warn_slowpath_fmt+0x74/0xc4) [ 411.849878] r7:00000213 r6:bf120990 r5:00000000 r4:e1c1c000 [ 411.849936] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>] (btree_csum_one_bio+0x22c/0x278 [btrfs]) [ 411.849941] r9:00000001 r8:e6afec78 r7:e1c1c000 r6:ed7f7000 r5:00001000 r4:8fd00000 [ 411.850044] [<bf058d20>] (btree_csum_one_bio [btrfs]) from [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs]) [ 411.850049] r10:e6afe4c0 r9:ecc57fc0 r8:ecc57f70 r7:ed7f7000 r6:00000000 r5:c1660370 [ 411.850051] r4:bf059e9c [ 411.850154] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs]) [ 411.850158] r7:ef2d58ec r6:e1c1dcac r5:00000000 r4:bf059e9c [ 411.850260] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>] (btree_write_cache_pages+0x380/0x408 [btrfs]) [ 411.850263] r5:00000000 r4:00000000 [ 411.850366] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs]) [ 411.850371] r10:00000001 r9:8fd00000 r8:c0280bcc r7:e1c1c000 r6:e1c1dd80 r5:ecc57f70 [ 411.850373] r4:e1c1dd80 [ 411.850428] [<bf05903c>] (btree_writepages [btrfs]) from [<c0284680>] (do_writepages+0x58/0xf4) [ 411.850431] r5:ecc57f70 r4:ecc57e68 [ 411.850439] [<c0284628>] (do_writepages) from [<c0278b68>] (__filemap_fdatawrite_range+0xf8/0x130) [ 411.850442] r8:ecc57f70 r7:00001000 r6:8fd0bfff r5:e1c1c000 r4:ecc57e68 [ 411.850448] [<c0278a70>] (__filemap_fdatawrite_range) from [<c0278cf0>] (filemap_fdatawrite_range+0x2c/0x34) [ 411.850452] r10:ecc57f70 r9:00001000 r8:8fd0bfff r7:e1c1de4c r6:e0202428 r5:00001000 [ 411.850454] r4:8fd0bfff [ 411.850509] [<c0278cc4>] (filemap_fdatawrite_range) from [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs]) [ 411.850512] r5:00000001 r4:00000000 [ 411.850615] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs]) [ 411.850620] r10:e1c1c000 r9:ed7f7010 r8:ed7f7000 r7:e0202428 r6:ed7f7000 r5:e1c1c000 [ 411.850621] r4:e56433f0 [ 411.850724] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs]) from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs]) [ 411.850728] r8:ed7f7418 r7:e0202400 r6:ed7f7000 r5:e56433f0 r4:00000000 [ 411.850831] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs]) [ 411.850836] r10:ed7f728c r9:00000000 r8:001abf6a r7:00000064 r6:ed7f7414 r5:00000bb8 [ 411.850838] r4:ed7f7000 [ 411.850893] [<bf05cbfc>] (transaction_kthread [btrfs]) from [<c014fab4>] (kthread+0x170/0x174) [ 411.850898] r10:ecbddbfc r9:bf05cbfc r8:ed669800 r7:e1c1c000 r6:00000000 r5:ed560d40 [ 411.850899] r4:ed54eb00 [ 411.850904] [<c014f944>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c) [ 411.850907] Exception stack(0xe1c1dfb0 to 0xe1c1dff8) [ 411.850911] dfa0: 00000000 00000000 00000000 00000000 [ 411.850915] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 411.850919] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 411.850923] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c014f944 [ 411.850925] r4:ed560d40 [ 411.850976] ---[ end trace 1bbf4c394bd7fb26 ]--- [ 411.850986] BTRFS warning (device sda1): page_start and eb_start mismatch, eb_start=17594598817792 page_start=2412773376 [ 411.857914] BTRFS: error (device sda1) in btrfs_commit_transaction:2279: errno=-5 IO failure (Error while writing out transaction) [ 411.869764] BTRFS info (device sda1): forced readonly [ 411.869772] BTRFS warning (device sda1): Skipping commit of aborted transaction. [ 411.869778] BTRFS: error (device sda1) in cleanup_transaction:1832: errno=-5 IO failure [ 411.877843] BTRFS info (device sda1): delayed_refs has NO entry [ 411.878076] BTRFS warning (device sda1): Skipping commit of aborted transaction. [ 411.882940] BTRFS warning (device sda1): failed setting block group ro: -30 [ 411.882956] BTRFS info (device sda1): scrub: not finished on devid 3 with status: -30 [ 411.883030] BTRFS warning (device sda1): failed setting block group ro: -30 [ 411.883039] BTRFS info (device sda1): scrub: not finished on devid 4 with status: -30 [ 411.883104] BTRFS warning (device sda1): failed setting block group ro: -30 [ 411.883116] BTRFS info (device sda1): scrub: not finished on devid 1 with status: -30 [ 411.883191] BTRFS warning (device sda1): failed setting block group ro: -30 [ 411.883200] BTRFS info (device sda1): scrub: not finished on devid 2 with status: -30 [ 411.953186] BTRFS info (device sda1): delayed_refs has NO entry [ 576.361548] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.369639] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.378477] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.386936] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.395093] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.396830] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.403153] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 576.450091] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.462821] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.474987] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.486882] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.497905] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.824673] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 576.838414] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 577.324362] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 577.337736] BTRFS error (device sda1): parent transid verify failed on 12756293910528 wanted 1752939 found 1752922 [ 578.462271] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 578.470347] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 578.479476] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 578.488099] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 578.496793] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 578.505280] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.539117] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.547207] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.556371] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.564922] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.573356] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 580.581573] BTRFS critical (device sda1): unable to find logical 2412789760 length 4096 [ 581.427776] verify_parent_transid: 28 callbacks suppressed ... many more of these ``` On Sun, Dec 20, 2020 at 8:30 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > > > On 2020/12/19 下午6:35, Joe Hermaszewski wrote: > > Ok, so I managed to get hold of a 64bit machine on which to run btrfs > > check. `btrfs check` returns exactly the same output as the armv7 box > > (no serious problems) which I suppose is good. `btrfs scrub` also > > finds no problems. Boring Logs below. > > > > What I don't quite understand is how the scrub problem on armv7l is so > > reliable when it's not persisted on the disks, is the same physical > > memory location being used for this breaking value, or is it perhaps a > > specific pattern of data on the bus which causes this? > > If it's so reliably reproducible, then I guess that would be the case, > either a specific memory range has the problem, or a specific pattern on > the bus causing the problem. > > > > > If it's the former, how easy would it be to find this broken location > > and blacklist it? If it's the latter then I guess there's no hope but > > to try replacing the psu/machine. > > > > The machine survived a couple of days of memtester (on about 95% of > > the RAM) and `7z b` with no problems, *shrug* > > The memtester should rule out the former case, the latter case may be > resolved by newer kernel if it's not a hardware problem but a software one. > > Thanks, > Qu > > > > > Best wishes and thanks for the generous help so far. > > Joe > > > > btrfs scrub aarch64: > > ``` > > [j@nixos:~]$ sudo btrfs scrub status -d /mnt > > UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124 > > scrub device /dev/sda1 (id 1) history > > Scrub started: Fri Dec 18 14:24:30 2020 > > Status: finished > > Duration: 7:36:31 > > Total to scrub: 2.40TiB > > Rate: 91.95MiB/s > > Error summary: no errors found > > scrub device /dev/sdb1 (id 2) history > > Scrub started: Fri Dec 18 14:24:30 2020 > > Status: finished > > Duration: 7:12:51 > > Total to scrub: 2.40TiB > > Rate: 96.90MiB/s > > Error summary: no errors found > > scrub device /dev/sdd1 (id 3) history > > Scrub started: Fri Dec 18 14:24:30 2020 > > Status: finished > > Duration: 19:47:01 > > Total to scrub: 7.86TiB > > Rate: 115.70MiB/s > > Error summary: no errors found > > scrub device /dev/sdc1 (id 4) history > > Scrub started: Fri Dec 18 14:24:30 2020 > > Status: finished > > Duration: 19:46:38 > > Total to scrub: 7.86TiB > > Rate: 115.74MiB/s > > Error summary: no errors found > > ``` > > > > btrfs check aarch64: > > ``` > > [nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1 > > Opening filesystem to check... > > Checking filesystem on /dev/sda1 > > UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124 > > [1/7] checking root items > > [2/7] checking extents > > [3/7] checking free space cache > > [4/7] checking fs roots > > root 294 inode 24665 errors 100, file extent discount > > Found file extent holes: > > start: 3709534208, len: 163840 > > root 294 inode 406548 errors 100, file extent discount > > Found file extent holes: > > start: 98729984, len: 720896 > > ERROR: errors found in fs roots > > found 11280701063168 bytes used, error(s) found > > total csum bytes: 10937464120 > > total tree bytes: 18538053632 > > total fs tree bytes: 5877579776 > > total extent tree bytes: 534052864 > > btree space waste bytes: 2316660292 > > file data blocks allocated: 17244587220992 > > referenced 14211684794368% > > ``` > > > > > > On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > >> > >> > >> > >> On 2020/11/27 下午11:15, Joe Hermaszewski wrote: > >>> Hi Qu, > >>> > >>> Thanks for the patch. I recompiled the kernel ran the scrub and your > >>> patch worked as expected, here is the log: > >>> > >>> ``` > >>> [ 337.365239] BTRFS info (device sda1): scrub: started on devid 2 > >>> [ 337.366283] BTRFS info (device sda1): scrub: started on devid 1 > >>> [ 337.402822] BTRFS info (device sda1): scrub: started on devid 3 > >>> [ 337.411944] BTRFS info (device sda1): scrub: started on devid 4 > >>> [ 471.997496] ------------[ cut here ]------------ > >>> [ 471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531 > >>> btree_csum_one_bio+0x22c/0x278 [btrfs] > >>> [ 471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat > >>> iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack > >>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter > >>> ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6 > >>> nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables > >>> iptable_filter sch_fq_codel loop tun tap macvlan bridge stp llc lm75 > >>> ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor raid6_pq > >>> [ 471.997666] CPU: 0 PID: 218 Comm: btrfs-transacti Not tainted 5.4.78 > >>> #1-NixOS > >>> [ 471.997668] Hardware name: Marvell Armada 380/385 (Device Tree) > >>> [ 471.997669] Backtrace: > >>> [ 471.997680] [<c010f698>] (dump_backtrace) from [<c010f938>] > >>> (show_stack+0x20/0x24) > >>> [ 471.997684] r7:00000213 r6:600c0013 r5:00000000 r4:c0f8c044 > >>> [ 471.997691] [<c010f918>] (show_stack) from [<c0a1b848>] > >>> (dump_stack+0x98/0xac) > >>> [ 471.997696] [<c0a1b7b0>] (dump_stack) from [<c012a998>] > >>> (__warn+0xe0/0x108) > >>> [ 471.997700] r7:00000213 r6:bf058f4c r5:00000009 r4:bf120990 > >>> [ 471.997704] [<c012a8b8>] (__warn) from [<c012ad24>] > >>> (warn_slowpath_fmt+0x74/0xc4) > >>> [ 471.997707] r7:00000213 r6:bf120990 r5:00000000 r4:e1822000 > >>> [ 471.997765] [<c012acb4>] (warn_slowpath_fmt) from [<bf058f4c>] > >>> (btree_csum_one_bio+0x22c/0x278 [btrfs]) > >>> [ 471.997770] r9:00000001 r8:c10decb8 r7:e1822000 r6:ed66b000 > >>> r5:00001000 r4:4fd00000 > >>> [ 471.997872] [<bf058d20>] (btree_csum_one_bio [btrfs]) from > >>> [<bf059f84>] (btree_submit_bio_hook+0xe8/0x100 [btrfs]) > >>> [ 471.997877] r10:e2197b48 r9:ec845fc0 r8:ec845f70 r7:ed66b000 > >>> r6:00000000 r5:e2b74af0 > >>> [ 471.997879] r4:bf059e9c > >>> [ 471.997981] [<bf059e9c>] (btree_submit_bio_hook [btrfs]) from > >>> [<bf08b1ac>] (submit_one_bio+0x44/0x5c [btrfs]) > >>> [ 471.997985] r7:ef3724d4 r6:e1823cac r5:00000000 r4:bf059e9c > >>> [ 471.998087] [<bf08b168>] (submit_one_bio [btrfs]) from [<bf096664>] > >>> (btree_write_cache_pages+0x380/0x408 [btrfs]) > >>> [ 471.998090] r5:00000000 r4:00000000 > >>> [ 471.998193] [<bf0962e4>] (btree_write_cache_pages [btrfs]) from > >>> [<bf0590b8>] (btree_writepages+0x7c/0x84 [btrfs]) > >>> [ 471.998197] r10:00000001 r9:4fd00000 r8:c0280d14 r7:e1822000 > >>> r6:e1823d80 r5:ec845f70 > >>> [ 471.998199] r4:e1823d80 > >>> [ 471.998255] [<bf05903c>] (btree_writepages [btrfs]) from > >>> [<c02847c8>] (do_writepages+0x58/0xf4) > >>> [ 471.998258] r5:ec845f70 r4:ec845e68 > >>> [ 471.998266] [<c0284770>] (do_writepages) from [<c0278cb0>] > >>> (__filemap_fdatawrite_range+0xf8/0x130) > >>> [ 471.998270] r8:ec845f70 r7:00001000 r6:4fd0bfff r5:e1822000 > >>> r4:ec845e68 > >>> [ 471.998275] [<c0278bb8>] (__filemap_fdatawrite_range) from > >>> [<c0278e38>] (filemap_fdatawrite_range+0x2c/0x34) > >>> [ 471.998279] r10:ec845f70 r9:00001000 r8:4fd0bfff r7:e1823e4c > >>> r6:c1955e28 r5:00001000 > >>> [ 471.998281] r4:4fd0bfff > >>> [ 471.998335] [<c0278e0c>] (filemap_fdatawrite_range) from > >>> [<bf060544>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs]) > >>> [ 471.998338] r5:00000001 r4:00000000 > >>> [ 471.998441] [<bf0604a8>] (btrfs_write_marked_extents [btrfs]) from > >>> [<bf0606f0>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs]) > >>> [ 471.998446] r10:e1822000 r9:ed66b010 r8:ed66b000 r7:c1955e28 > >>> r6:ed66b000 r5:e1822000 > >>> [ 471.998447] r4:eddc4f30 > >>> [ 471.998551] [<bf06069c>] (btrfs_write_and_wait_transaction [btrfs]) > >>> from [<bf062428>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs]) > >>> [ 471.998555] r8:ed66b418 r7:c1955e00 r6:ed66b000 r5:eddc4f30 > >>> r4:00000000 > >>> [ 471.998658] [<bf061ccc>] (btrfs_commit_transaction [btrfs]) from > >>> [<bf05cd98>] (transaction_kthread+0x19c/0x1e0 [btrfs]) > >>> [ 471.998663] r10:ed66b28c r9:00000000 r8:001aab57 r7:00000064 > >>> r6:ed66b414 r5:00000bb8 > >>> [ 471.998664] r4:ed66b000 > >>> [ 471.998719] [<bf05cbfc>] (transaction_kthread [btrfs]) from > >>> [<c014facc>] (kthread+0x170/0x174) > >>> [ 471.998724] r10:ed765bfc r9:bf05cbfc r8:ed56a000 r7:e1822000 > >>> r6:00000000 r5:ed5c8600 > >>> [ 471.998726] r4:ed5c8740 > >>> [ 471.998731] [<c014f95c>] (kthread) from [<c01010e8>] > >>> (ret_from_fork+0x14/0x2c) > >>> [ 471.998733] Exception stack(0xe1823fb0 to 0xe1823ff8) > >>> [ 471.998737] 3fa0: 00000000 > >>> 00000000 00000000 00000000 > >>> [ 471.998741] 3fc0: 00000000 00000000 00000000 00000000 00000000 > >>> 00000000 00000000 00000000 > >>> [ 471.998745] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 > >>> [ 471.998749] r10:00000000 r9:00000000 r8:00000000 r7:00000000 > >>> r6:00000000 r5:c014f95c > >>> [ 471.998751] r4:ed5c8600 > >>> [ 471.998820] ---[ end trace 6a9fb9f547659a4d ]--- > >>> [ 471.998855] BTRFS warning (device sda1): page_start and eb_start > >>> mismatch, eb_start=17593525075968 page_start=1339031552 > >>> [ 472.015514] BTRFS: error (device sda1) in > >>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while > >>> writing out transaction) > >>> [ 472.027321] BTRFS info (device sda1): forced readonly > >>> [ 472.027326] BTRFS warning (device sda1): Skipping commit of aborted > >>> transaction. > >>> [ 472.027330] BTRFS: error (device sda1) in cleanup_transaction:1832: > >>> errno=-5 IO failure > >>> [ 472.035378] BTRFS info (device sda1): delayed_refs has NO entry > >>> [ 472.039102] BTRFS warning (device sda1): failed setting block group > >>> ro: -30 > >>> [ 472.039107] BTRFS info (device sda1): scrub: not finished on devid > >>> 3 with status: -30 > >>> [ 472.039114] BTRFS info (device sda1): scrub: not finished on devid > >>> 4 with status: -30 > >>> [ 472.039227] BTRFS warning (device sda1): failed setting block group > >>> ro: -30 > >>> [ 472.039230] BTRFS warning (device sda1): failed setting block group > >>> ro: -30 > >>> [ 472.039237] BTRFS info (device sda1): scrub: not finished on devid > >>> 2 with status: -30 > >>> [ 472.039243] BTRFS info (device sda1): scrub: not finished on devid > >>> 1 with status: -30 > >>> ``` > >>> > >>> To rob you of the suspense: > >>> > >>> printf "0x%016x\n" (17593525075968 - 1339031552) > >>> 0x0000100000000000 > >>> > >>> Very surprising considering that this machine has ECC ram! > >> > >> ARMv7 with ECC, and we still had such bitflip!? > >> > >> I guess everything is possible in 2020. > >> > >>> > >>> What would you recommend as a fix? > >> > >> It's a hardware problem, I really don't have any recommendation at all. > >> > >> I guess you may want to request an RMA? > >> > >> Thanks, > >> Qu > >> > >>> > >>> Best, > >>> Joe > >>> > >>> On Thu, Nov 26, 2020 at 7:05 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > >>>> > >>>> > >>>> > >>>> On 2020/11/26 下午6:53, Joe Hermaszewski wrote: > >>>>> Hi Qu, > >>>>> > >>>>> This is the output of btrfs check with the device unmounted: > >>>>> > >>>>> ``` > >>>>> [root@nixos:~]# btrfs check --readonly /dev/sda1 > >>>>> Opening filesystem to check... > >>>>> Checking filesystem on /dev/sda1 > >>>>> UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124 > >>>>> [1/7] checking root items > >>>>> [2/7] checking extents > >>>>> [3/7] checking free space cache > >>>>> [4/7] checking fs roots > >>>>> root 294 inode 24665 errors 100, file extent discount > >>>>> Found file extent holes: > >>>>> start: 3709534208, len: 163840 > >>>>> root 294 inode 406548 errors 100, file extent discount > >>>>> Found file extent holes: > >>>>> start: 0, len: 99450880 > >>>>> ERROR: errors found in fs roots > >>>>> found 11279251902464 bytes used, error(s) found > >>>>> total csum bytes: 10935906504 > >>>>> total tree bytes: 18455740416 > >>>>> total fs tree bytes: 5798936576 > >>>>> total extent tree bytes: 532250624 > >>>>> btree space waste bytes: 2308849707 > >>>>> file data blocks allocated: 17243418611712 > >>>>> referenced 14210256932864 > >>>>> ``` > >>>>> > >>>>> Looks the same as before... > >>>> > >>>> Then it means, at least your metadata is pretty safe. Nothing really > >>>> needs to be worried (so far). > >>>> > >>>> This also means, the problem only happens at runtime, not something > >>>> on-disk. > >>>> > >>>> If you are OK to re-compile the kernel, and the warning message is > >>>> always the same at line number 531 of disk-io.c, then I recommend the > >>>> attached debug diff > >>>> > >>>> The diff will output the found_start (from eb) and page_start (from > >>>> page). > >>>> If it's memory bitflip, we should have a pretty signature diff at the > >>>> output. > >>>> If not memory bitflip, then we really have a bug to dig. > >>>> > >>>> > >>>>> > >>>>> I'm afraid I don't have the precise details of the transid error > >>>>> saved. I think I meant to write earlier that the message that comes up > >>>>> now and then is the "errno=-5 IO failure (Error while writing out > >>>>> transaction)". I've not seen the transid mismatch one since the first > >>>>> incident, sorry for that slip-up. > >>>>> > >>>>> I'll try and attach to a known good system, will I have to attach the > >>>>> whole array because I don't think I have that many enclosures for the > >>>>> drives. > >>>> > >>>> You only need to mount the fs you're seeing the problem. > >>>> But if that fs is on multiple devices, then I guess you need to mount > >>>> them all. > >>>> > >>>> Thanks, > >>>> Qu > >>>>> > >>>>> I'll also run a memtest on this box. > >>>>> > >>>>> Thank you for the helpful response. > >>>>> > >>>>> Best, > >>>>> Joe > >>>>> > >>>>> On Thu, Nov 26, 2020 at 2:15 PM Qu Wenruo <quwenruo.bt...@gmx.com> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 2020/11/25 下午11:28, Joe Hermaszewski wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> I have a arm32 machine with four drives with a btrfs fs spanning then > >>>>>>> in RAID1. > >>>>>>> The filesystem has started behaving badly recently and I'm writing to: > >>>>>>> > >>>>>>> - Solicit advice on how best to get the system back to a stable state > >>>>>>> - Report a potential bug > >>>>>>> > >>>>>>> ## What happened: > >>>>>>> > >>>>>>> A couple of days ago I could no longer ssh into it, and on the serial > >>>>>>> connection there were heaps of messages (and new ones appearing with > >>>>>>> great > >>>>>>> frequency) along the lines of: `parent transid verify failed on > >>>>>>> blah... wanted > >>>>>>> x got y`. > >>>>>>> > >>>>>>> Although I don't have a record of the precise messages I do remember > >>>>>>> that there > >>>>>>> was a difference of `15` between x and y. > >>>>>> > >>>>>> This normally can be a sign of unreliable HDD, which lies on FLUSH, > >>>>>> killing metadata COW. > >>>>>> > >>>>>> But, your btrfs check doesn't report the same problem, thus I'm > >>>>>> confused. > >>>>>> > >>>>>> Would you please try to run a "btrfs check --readonly /dev/sda1" with > >>>>>> the fs unmounted? > >>>>>> > >>>>>> And, would you provide the full dmesg of that mismatch? > >>>>>> The reason for the exact number is, I'm suspecting hardware memory > >>>>>> corruption. > >>>>>> > >>>>>>> > >>>>>>> I power-cycled system and started a scrub after it rebooted, this was > >>>>>>> interrupted quite promptly by several more errors in btrfs, and the > >>>>>>> disk > >>>>>>> remounted RO. > >>>>>>> > >>>>>>> Every now and then in the kernel log I get messages like: > >>>>>>> > >>>>>>> `parent transid verify failed on blah... wanted x got y` > >>>>>> > >>>>>> Not showing up in the gist dmesg though. > >>>>>> > >>>>>>> > >>>>>>> ## Important info > >>>>>>> > >>>>>>> The dev stats are all zero. > >>>>>>> > >>>>>>> Here are the outputs of some btrfs commands, dmesg and the kernel log > >>>>>>> from the > >>>>>>> previous two boots: > >>>>>>> https://gist.github.com/b1beab134403c5047e2efbceb98985f9 > >>>>>>> > >>>>>>> The "cut here" portion of the kernel log is as follows > >>>>>>> > >>>>>>> ``` > >>>>>>> [ 409.158097] ------------[ cut here ]------------ > >>>>>>> [ 409.158205] WARNING: CPU: 1 PID: 217 at fs/btrfs/disk-io.c:531 > >>>>>>> btree_csum_one_bio+0x208/0x248 [btrfs] > >>>>>> > >>>>>> The line number shows, the tree bytenr doesn't match with the one in > >>>>>> memory. > >>>>>> This is really too rare to be seen, especially when we have no other > >>>>>> error reported from btrfs (at least in the gist) > >>>>>> > >>>>>> Since there is no other problems showing up in the gist, it means it > >>>>>> could be a bit flip, considering the magic generation gap you mention > >>>>>> is > >>>>>> 15, I'm more suspicious about memory bit flip. > >>>>>> > >>>>>> If you can provide the full parent transid mismatch error message, it > >>>>>> may help to determine the possibility of hardware memory corruption. > >>>>>> > >>>>>> Thanks, > >>>>>> Qu > >>>>>> > >>>>>>> [ 409.158208] Modules linked in: cfg80211 rfkill 8021q ip6table_nat > >>>>>>> iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 > >>>>>>> nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw > >>>>>>> xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp > >>>>>>> ftdi_sio usbserial phy_generic uio_pdrv_genirq uio ip6table_filter > >>>>>>> ip6_tables iptable_filter sch_fq_codel loop tun tap macvlan bridge stp > >>>>>>> llc lm75 ip_tables x_tables autofs4 dm_mod dax btrfs libcrc32c xor > >>>>>>> raid6_pq > >>>>>>> [ 409.158258] CPU: 1 PID: 217 Comm: btrfs-transacti Not tainted > >>>>>>> 5.4.77 #1-NixOS > >>>>>>> [ 409.158260] Hardware name: Marvell Armada 380/385 (Device Tree) > >>>>>>> [ 409.158261] Backtrace: > >>>>>>> [ 409.158272] [<c010f698>] (dump_backtrace) from [<c010f938>] > >>>>>>> (show_stack+0x20/0x24) > >>>>>>> [ 409.158277] r7:00000213 r6:600f0013 r5:00000000 r4:c0f8c044 > >>>>>>> [ 409.158283] [<c010f918>] (show_stack) from [<c0a1b388>] > >>>>>>> (dump_stack+0x98/0xac) > >>>>>>> [ 409.158288] [<c0a1b2f0>] (dump_stack) from [<c012a998>] > >>>>>>> (__warn+0xe0/0x108) > >>>>>>> [ 409.158292] r7:00000213 r6:bf058ec8 r5:00000009 r4:bf120990 > >>>>>>> [ 409.158296] [<c012a8b8>] (__warn) from [<c012ad24>] > >>>>>>> (warn_slowpath_fmt+0x74/0xc4) > >>>>>>> [ 409.158300] r7:00000213 r6:bf120990 r5:00000000 r4:e2392000 > >>>>>>> [ 409.158358] [<c012acb4>] (warn_slowpath_fmt) from [<bf058ec8>] > >>>>>>> (btree_csum_one_bio+0x208/0x248 [btrfs]) > >>>>>>> [ 409.158363] r9:e277abe0 r8:00000001 r7:e2392000 r6:ea3d17f0 > >>>>>>> r5:00000000 r4:eefd2d3c > >>>>>>> [ 409.158465] [<bf058cc0>] (btree_csum_one_bio [btrfs]) from > >>>>>>> [<bf059ef4>] (btree_submit_bio_hook+0xe8/0x100 [btrfs]) > >>>>>>> [ 409.158470] r10:e32ce170 r9:ecc45fc0 r8:ecc45f70 r7:ec82b000 > >>>>>>> r6:00000000 r5:ea3d17f0 > >>>>>>> [ 409.158472] r4:bf059e0c > >>>>>>> [ 409.158575] [<bf059e0c>] (btree_submit_bio_hook [btrfs]) from > >>>>>>> [<bf08b11c>] (submit_one_bio+0x44/0x5c [btrfs]) > >>>>>>> [ 409.158578] r7:ef36c048 r6:e2393cac r5:00000000 r4:bf059e0c > >>>>>>> [ 409.158683] [<bf08b0d8>] (submit_one_bio [btrfs]) from [<bf0965d4>] > >>>>>>> (btree_write_cache_pages+0x380/0x408 [btrfs]) > >>>>>>> [ 409.158686] r5:00000000 r4:00000000 > >>>>>>> [ 409.158788] [<bf096254>] (btree_write_cache_pages [btrfs]) from > >>>>>>> [<bf059028>] (btree_writepages+0x7c/0x84 [btrfs]) > >>>>>>> [ 409.158793] r10:00000001 r9:4fd00000 r8:c0280c94 r7:e2392000 > >>>>>>> r6:e2393d80 r5:ecc45f70 > >>>>>>> [ 409.158794] r4:e2393d80 > >>>>>>> [ 409.158850] [<bf058fac>] (btree_writepages [btrfs]) from > >>>>>>> [<c0284748>] (do_writepages+0x58/0xf4) > >>>>>>> [ 409.158852] r5:ecc45f70 r4:ecc45e68 > >>>>>>> [ 409.158860] [<c02846f0>] (do_writepages) from [<c0278c30>] > >>>>>>> (__filemap_fdatawrite_range+0xf8/0x130) > >>>>>>> [ 409.158864] r8:ecc45f70 r7:00001000 r6:4fd0bfff r5:e2392000 > >>>>>>> r4:ecc45e68 > >>>>>>> [ 409.158869] [<c0278b38>] (__filemap_fdatawrite_range) from > >>>>>>> [<c0278db8>] (filemap_fdatawrite_range+0x2c/0x34) > >>>>>>> [ 409.158874] r10:ecc45f70 r9:00001000 r8:4fd0bfff r7:e2393e4c > >>>>>>> r6:c73bc628 r5:00001000 > >>>>>>> [ 409.158875] r4:4fd0bfff > >>>>>>> [ 409.158929] [<c0278d8c>] (filemap_fdatawrite_range) from > >>>>>>> [<bf0604b4>] (btrfs_write_marked_extents+0x9c/0x1b0 [btrfs]) > >>>>>>> [ 409.158931] r5:00000001 r4:00000000 > >>>>>>> [ 409.159033] [<bf060418>] (btrfs_write_marked_extents [btrfs]) from > >>>>>>> [<bf060660>] (btrfs_write_and_wait_transaction+0x54/0xa4 [btrfs]) > >>>>>>> [ 409.159038] r10:e2392000 r9:ec82b010 r8:ec82b000 r7:c73bc628 > >>>>>>> r6:ec82b000 r5:e2392000 > >>>>>>> [ 409.159040] r4:c8b81ca8 > >>>>>>> [ 409.159141] [<bf06060c>] (btrfs_write_and_wait_transaction [btrfs]) > >>>>>>> from [<bf062398>] (btrfs_commit_transaction+0x75c/0xc94 [btrfs]) > >>>>>>> [ 409.159145] r8:ec82b418 r7:c73bc600 r6:ec82b000 r5:c8b81ca8 > >>>>>>> r4:00000000 > >>>>>>> [ 409.159248] [<bf061c3c>] (btrfs_commit_transaction [btrfs]) from > >>>>>>> [<bf05cd08>] (transaction_kthread+0x19c/0x1e0 [btrfs]) > >>>>>>> [ 409.159253] r10:ec82b28c r9:00000000 r8:001aaafa r7:00000064 > >>>>>>> r6:ec82b414 r5:00000bb8 > >>>>>>> [ 409.159254] r4:ec82b000 > >>>>>>> [ 409.159309] [<bf05cb6c>] (transaction_kthread [btrfs]) from > >>>>>>> [<c014fabc>] (kthread+0x170/0x174) > >>>>>>> [ 409.159313] r10:eca87bfc r9:bf05cb6c r8:ed619000 r7:e2392000 > >>>>>>> r6:00000000 r5:ed5ee700 > >>>>>>> [ 409.159315] r4:ed5ee1c0 > >>>>>>> [ 409.159320] [<c014f94c>] (kthread) from [<c01010e8>] > >>>>>>> (ret_from_fork+0x14/0x2c) > >>>>>>> [ 409.159322] Exception stack(0xe2393fb0 to 0xe2393ff8) > >>>>>>> [ 409.159326] 3fa0: 00000000 > >>>>>>> 00000000 00000000 00000000 > >>>>>>> [ 409.159331] 3fc0: 00000000 00000000 00000000 00000000 00000000 > >>>>>>> 00000000 00000000 00000000 > >>>>>>> [ 409.159334] 3fe0: 00000000 00000000 00000000 00000000 00000013 > >>>>>>> 00000000 > >>>>>>> [ 409.159338] r10:00000000 r9:00000000 r8:00000000 r7:00000000 > >>>>>>> r6:00000000 r5:c014f94c > >>>>>>> [ 409.159340] r4:ed5ee700 > >>>>>>> [ 409.159342] ---[ end trace eea59ced12fa7859 ]--- > >>>>>>> [ 409.165084] BTRFS: error (device sda1) in > >>>>>>> btrfs_commit_transaction:2279: errno=-5 IO failure (Error while > >>>>>>> writing out transaction) > >>>>>>> [ 409.176920] BTRFS info (device sda1): forced readonly > >>>>>>> [ 409.176947] BTRFS warning (device sda1): Skipping commit of aborted > >>>>>>> transaction. > >>>>>>> [ 409.176952] BTRFS: error (device sda1) in cleanup_transaction:1832: > >>>>>>> errno=-5 IO failure > >>>>>>> [ 409.185049] BTRFS info (device sda1): delayed_refs has NO entry > >>>>>>> [ 409.310199] BTRFS info (device sda1): scrub: not finished on devid > >>>>>>> 3 with status: -125 > >>>>>>> [ 409.664880] BTRFS info (device sda1): scrub: not finished on devid > >>>>>>> 4 with status: -125 > >>>>>>> [ 410.106791] BTRFS info (device sda1): scrub: not finished on devid > >>>>>>> 1 with status: -125 > >>>>>>> [ 411.268585] BTRFS warning (device sda1): failed setting block > >>>>>>> group ro: -30 > >>>>>>> [ 411.268594] BTRFS info (device sda1): scrub: not finished on devid > >>>>>>> 2 with status: -30 > >>>>>>> [ 411.268605] BTRFS info (device sda1): delayed_refs has NO entry > >>>>>>> ``` > >>>>>>> > >>>>>>> Information requested here > >>>>>>> (https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list): > >>>>>>> > >>>>>>> ``` > >>>>>>> $ uname -a > >>>>>>> Linux thanos 5.4.77 #1-NixOS SMP Tue Nov 10 20:13:20 UTC 2020 armv7l > >>>>>>> GNU/Linux > >>>>>>> > >>>>>>> $ btrfs --version > >>>>>>> btrfs-progs v5.7 > >>>>>>> > >>>>>>> $ sudo btrfs fi show > >>>>>>> Label: none uuid: b8f4ad49-29c8-4d19-a886-cef9c487f124 > >>>>>>> Total devices 4 FS bytes used 10.26TiB > >>>>>>> devid 1 size 3.64TiB used 2.40TiB path /dev/sda1 > >>>>>>> devid 2 size 3.64TiB used 2.40TiB path /dev/sdc1 > >>>>>>> devid 3 size 9.09TiB used 7.86TiB path /dev/sdd1 > >>>>>>> devid 4 size 9.09TiB used 7.86TiB path /dev/sdb1 > >>>>>>> > >>>>>>> Label: none uuid: d02a3067-0a23-4c1f-96ac-80dbc26622f2 > >>>>>>> Total devices 1 FS bytes used 116.35MiB > >>>>>>> devid 1 size 399.82MiB used 224.00MiB path /dev/sda2 > >>>>>>> > >>>>>>> $ sudo btrfs fi df / > >>>>>>> Data, RAID1: total=10.25TiB, used=10.24TiB > >>>>>>> System, RAID1: total=64.00MiB, used=1.45MiB > >>>>>>> Metadata, RAID1: total=18.00GiB, used=17.19GiB > >>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B > >>>>>>> ``` > >>>>>>> > >>>>>>> Thanks to demfloro and multicore on #btrfs for prompting this email. > >>>>>>> > >>>>>>> Best wishes, > >>>>>>> Joe > >>>>>>> > >>>>>> > >>