Re: Uncorrectable errors on RAID6

Qu Wenruo Thu, 28 May 2015 19:27:49 -0700


-------- Original Message  --------
Subject: Re: Uncorrectable errors on RAID6
From: Tobias Holst <to...@tobby.eu>
To: Qu Wenruo <quwen...@cn.fujitsu.com>
Date: 2015年05月29日 10:00

Thanks, Qu, sad news... :-(
No, I also didn't defrag with older kernels. Maybe I did it a while
ago with 3.19.x, but there was a scrub afterwards and it showed no
error, so this shouldn't be the problem. The things described above
were all done with 4.0.3/4.0.4.

Balances and scrubs all stop at ~1.5 TiB of ~13.3TiB. Balance with an
error in the log, scrub just doesn't do anything according to dstat
without any error and still shows "running".

The errors/problems started during the first balance but maybe this
only showed them and is not the cause.

Here detailed debug infos to (maybe?) recreate the problem. This is
exactly what happened here over some time. As I can only tell when it
definitively has been clean (scrub at the beginning of May) an when it
definitively was broken (now, end of May), there may be some more
steps neccessary to reproduce, because several things happened in the
meantime:
- filesystem was created with "mkfs.btrfs -f -m raid6 -d raid6 -L
t-raid -O extref,raid56,skinny-metadata,no-holes" with 6
LUKS-encrypted HDDs on kernel 3.19

LUKS...
Even LUKS is much stabler than btrfs, and may not be related to the
bug, your setup is quite complex anyway.

- mounted with options "defaults,compress-force=zlib,space_cache,autodefrag"

Normally i'd not recommend compress-force as btrfs can auto detectcompress ratio.

But such complex setting up with such mount option as LUKS base should
be quite a good playground to produce some of bug.

- copies all data onto it
- all data on the devices is now compressed with zlib
-> until now the filesystem is ok, scrub shows no errors

autodefrag seems not related to this bug as you removed it from the
mount option.
As it doesn't even have effect, as you copy data from other place,
without overwrite.

- now mount it with "defaults,compress-force=lzo,space_cache" instead
- use kernel 4.0.3/4.0.4
- create a r/o-snapshot

RO snapshot, I remember there is a RO snapshot bug, but seems fixed in 4.x?

- defrag some data with "-clzo"
- have some (not much) I/O during the process
- this should approx. double the size of the defragged data because
your snapshot contains your data compressed with zlib and your volume
contains your data compressed with lzo
- delete the snapshot
- wait some time until the cleaning is complete, still some other I/O
during this
- this doesn't free as much data as the snapshot contained (?)
-> is this ok? Maybe here the problem already existed/started
- defrag the rest of all data on the devices with "-clzo", still some
other I/O during this
- now start a balance of the whole array
-> errors will spam the log and it's broken.

I hope, it is possible to reproduce the errors and find out exactly
when this happens. I'll do the same steps again, too, but maybe there
is someone else who could try it as well?

I'll try it with script, but maybe without LUKS to simplify the setup.

With some small loop-devices
just for testing this shouldn't take too long even if it sounds like
that ;-)

Back to my actual data: Are there any tips on how to recover?

For recovery, first just try cp -r <mnt>/* to grab what's stillcompletely OK.

Maybe recovery mount option can do some help in the process?

Then you may try "btrfs restore", which is the safest method, won't
write any byte into the offline disks.

Lastly, you can try the btrfsck --repair, *WITH BINARY BACKUP OF YOUR DISKS*

For best luck, it can make your filesystem completely clean at the cost
of some file lost(maybe file name lost or part of data lost, or nothing
remaining).

Some corrupted file can be partly recovered into 'lost+found' dir ofeach subvolume.

At the best case, the recovered fs can pass btrfsck without any error.

But for your case, the salvaged data will be somewhat meaningless, as
it works best for uncompressed data!

And for the worst case, your filesystem will be corrupted even more.
So consider twice before using btrfsck --repair.

BTW, if you decided to use btrfs --repair, please upload the full
output, since we can use it to improve the b-tree recovery codes.
(Yeah, welcome to be a laboratory mice of real world b-tree recovery codes)

Thanks,
Qu
> Mount

with "recover", copy over and see the log, which files seem to be
broken? Or some (dangerous) tricks on how to repair this broken file
system?
I do have a full backup, but it's very slow and may take weeks
(months?), if I have to recover everything.

Regards,
Tobias



2015-05-29 2:36 GMT+02:00 Qu Wenruo <quwen...@cn.fujitsu.com>:



-------- Original Message  --------
Subject: Re: Uncorrectable errors on RAID6
From: Tobias Holst <to...@tobby.eu>
To: Qu Wenruo <quwen...@cn.fujitsu.com>
Date: 2015年05月28日 21:13

Ah it's already done. You can find the error-log over here:
https://paste.ee/p/sxCKF

In short there are several of these:
bytenr mismatch, want=6318462353408, have=56676169344768
checksum verify failed on 8955306033152 found 14EED112 wanted 6F1EB890
checksum verify failed on 8955306033152 found 14EED112 wanted 6F1EB890
checksum verify failed on 8955306033152 found 5B5F717A wanted C44CA54E
checksum verify failed on 8955306033152 found CF62F201 wanted E3B7021A
checksum verify failed on 8955306033152 found CF62F201 wanted E3B7021A

and these:
ref mismatch on [13431504896 16384] extent item 1, found 0
Backref 13431504896 root 7 not referenced back 0x1202acc0
Incorrect global backref count on 13431504896 found 1 wanted 0
backpointer mismatch on [13431504896 16384]
owner ref check failed [13431504896 16384]

and these:
ref mismatch on [1951739412480 524288] extent item 0, found 1
Backref 1951739412480 root 5 owner 27852 offset 644349952 num_refs 0
not found in extent tree
Incorrect local backref count on 1951739412480 root 5 owner 27852
offset 644349952 found 1 wanted 0 back 0x1a92aa20
backpointer mismatch on [1951739412480 524288]

Any ideas? :)

The metadata is really corrupted...

I'd recommend to salvage your data as soon as possible.

For the reason, as you didn't run replace, it should at least not the
bug spotted by Zhao Lei.

BTW, did you run defrag on older kernels?
IIRC, old kernel has bug with snapshot aware defrag, so it's later
disabled in newer kernel.
Not sure if it's related.

Balance may be related but I'm not familiar with balance with RAID5/6.
So hard to say.

Sorry for unable to provide much help.

But if you have enough time to find a stable method to reproduce the bug,
best try it on loop device, it would definitely help us to debug.

Thanks,
Qu

Regards
Tobias


2015-05-28 14:57 GMT+02:00 Tobias Holst <to...@tobby.eu>:


Hi Qu,

no, I didn't run a replace. But I ran a defrag with "-clzo" on all
files while there has been slightly I/O on the devices. Don't know if
this could cause corruptions, too?

Later on I deleted a r/o-snapshot which should free a big amount of
storage space. It didn't free as much as it should so after a few days
I started a balance to free the space. During the balance the first
checksum errors happened and the whole balance process crashed:

[19174.342882] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.365473] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.365651] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.366168] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.366250] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.366392] BTRFS: dm-5 checksum verify failed on 6318462353408
wanted 25D94CD6 found 8BA427D4 level 1
[19174.367313] ------------[ cut here ]------------
[19174.367340] kernel BUG at
/home/kernel/COD/linux/fs/btrfs/relocation.c:242!
[19174.367384] invalid opcode: 0000 [#1] SMP
[19174.367418] Modules linked in: iosf_mbi kvm_intel kvm
crct10dif_pclmul ppdev dm_crypt crc32_pclmul ghash_clmulni_intel
aesni_intel aes_x86_64 lrw gf128mul glue_helper parport_pc ablk_helper
cryptd mac_hid 8250_fintek virtio_rng serio_raw i2c_piix4 pvpanic lp
parport btrfs xor raid6_pq cirrus syscopyarea sysfillrect sysimgblt
ttm mpt2sas drm_kms_helper raid_class scsi_transport_sas drm floppy
psmouse pata_acpi
[19174.367656] CPU: 1 PID: 4960 Comm: btrfs Not tainted
4.0.4-040004-generic #201505171336
[19174.367703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[19174.367752] task: ffff8804274e8000 ti: ffff880367b50000 task.ti:
ffff880367b50000
[19174.367797] RIP: 0010:[<ffffffffc05ec4ba>]  [<ffffffffc05ec4ba>]
backref_cache_cleanup+0xea/0x100 [btrfs]
[19174.367867] RSP: 0018:ffff880367b53bd8  EFLAGS: 00010202
[19174.367905] RAX: ffff88008250d8f8 RBX: ffff88008250d820 RCX:
0000000180200001
[19174.367948] RDX: ffff88008250d8d8 RSI: ffff88008250d8e8 RDI:
0000000040000000
[19174.367992] RBP: ffff880367b53bf8 R08: ffff880418b77780 R09:
0000000180200001
[19174.368037] R10: ffffffffc05ec1d9 R11: 0000000000018bf8 R12:
0000000000000001
[19174.368081] R13: ffff88008250d8e8 R14: 00000000fffffffb R15:
ffff880367b53c28
[19174.368125] FS:  00007f7fd6831c80(0000) GS:ffff88043fc40000(0000)
knlGS:0000000000000000
[19174.368172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19174.368210] CR2: 00007f65f7564770 CR3: 00000003ac92f000 CR4:
00000000001407e0
[19174.368257] Stack:
[19174.368279]  00000000fffffffb ffff88008250d800 ffff88042b3d46e0
ffff88006845f990
[19174.368327]  ffff880367b53c78 ffffffffc05f25eb ffff880367b53c78
0000000000000002
[19174.368376]  00ff880429e4c670 a9000010d8fb7e00 0000000000000000
0000000000000000
[19174.368424] Call Trace:
[19174.368459]  [<ffffffffc05f25eb>] relocate_block_group+0x2cb/0x510
[btrfs]
[19174.368509]  [<ffffffffc05f29e0>]
btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs]
[19174.368562]  [<ffffffffc05c6eab>]
btrfs_relocate_chunk.isra.75+0x4b/0xd0 [btrfs]
[19174.368615]  [<ffffffffc05c82e8>] __btrfs_balance+0x348/0x460 [btrfs]
[19174.368663]  [<ffffffffc05c87b5>] btrfs_balance+0x3b5/0x5d0 [btrfs]
[19174.368710]  [<ffffffffc05d5cac>] btrfs_ioctl_balance+0x1cc/0x530
[btrfs]
[19174.368756]  [<ffffffff811b52e0>] ? handle_mm_fault+0xb0/0x160
[19174.368802]  [<ffffffffc05d7c7e>] btrfs_ioctl+0x69e/0xb20 [btrfs]
[19174.368845]  [<ffffffff8120f5b5>] do_vfs_ioctl+0x75/0x320
[19174.368882]  [<ffffffff8120f8f1>] SyS_ioctl+0x91/0xb0
[19174.368923]  [<ffffffff817f098d>] system_call_fastpath+0x16/0x1b
[19174.368962] Code: 3b 00 75 29 44 8b a3 00 01 00 00 45 85 e4 75 1b
44 8b 9b 04 01 00 00 45 85 db 75 0d 48 83 c4 08 5b 41 5c 41 5d 5d c3
0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00
00 00
[19174.369133] RIP  [<ffffffffc05ec4ba>]
backref_cache_cleanup+0xea/0x100 [btrfs]
[19174.369186]  RSP <ffff880367b53bd8>
[19174.369827] ------------[ cut here ]------------
[19174.369827] kernel BUG at
/home/kernel/COD/linux/arch/x86/mm/pageattr.c:216!
[19174.369827] invalid opcode: 0000 [#2] SMP
[19174.369827] Modules linked in: iosf_mbi kvm_intel kvm
crct10dif_pclmul ppdev dm_crypt crc32_pclmul ghash_clmulni_intel
aesni_intel aes_x86_64 lrw gf128mul glue_helper parport_pc ablk_helper
cryptd mac_hid 8250_fintek virtio_rng serio_raw i2c_piix4 pvpanic lp
parport btrfs xor raid6_pq cirrus syscopyarea sysfillrect sysimgblt
ttm mpt2sas drm_kms_helper raid_class scsi_transport_sas drm floppy
psmouse pata_acpi
[19174.369827] CPU: 1 PID: 4960 Comm: btrfs Not tainted
4.0.4-040004-generic #201505171336
[19174.369827] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[19174.369827] task: ffff8804274e8000 ti: ffff880367b50000 task.ti:
ffff880367b50000
[19174.369827] RIP: 0010:[<ffffffff8106875f>]  [<ffffffff8106875f>]
cpa_flush_array+0x10f/0x120
[19174.369827] RSP: 0018:ffff880367b52cf8  EFLAGS: 00010046
[19174.369827] RAX: 0000000000000092 RBX: 0000000000000000 RCX:
0000000000000005
[19174.369827] RDX: 0000000000000001 RSI: 0000000000000200 RDI:
0000000000000000
[19174.369827] RBP: ffff880367b52d48 R08: ffff880411ef2000 R09:
0000000000000001
[19174.369827] R10: 0000000000000004 R11: ffffffff81adb6be R12:
0000000000000200
[19174.369827] R13: 0000000000000001 R14: 0000000000000005 R15:
0000000000000000
[19174.369827] FS:  00007f7fd6831c80(0000) GS:ffff88043fc40000(0000)
knlGS:0000000000000000
[19174.369827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19174.369827] CR2: 00007f65f7564770 CR3: 00000003ac92f000 CR4:
00000000001407e0
[19174.369827] Stack:
[19174.369827]  0000000000000001 ffff880411ef2000 0000000000000001
0000000000000001
[19174.369827]  ffff880367b52d48 0000000000000000 0000000000000200
0000000000000000
[19174.369827]  0000000000000004 0000000000000000 ffff880367b52de8
ffffffff8106979c
[19174.369827] Call Trace:
[19174.369827]  [<ffffffff8106979c>] change_page_attr_set_clr+0x23c/0x2c0
[19174.369827]  [<ffffffff810699b0>] _set_pages_array+0xf0/0x140
[19174.369827]  [<ffffffff81069a13>] set_pages_array_wc+0x13/0x20
[19174.369827]  [<ffffffffc052d926>] ttm_set_pages_caching+0x46/0x80
[ttm]
[19174.369827]  [<ffffffffc052da24>]
ttm_alloc_new_pages.isra.6+0xc4/0x1a0 [ttm]
[19174.369827]  [<ffffffffc052dc76>]
ttm_page_pool_fill_locked.isra.7.constprop.12+0x96/0x140 [ttm]
[19174.369827]  [<ffffffffc052dd5a>]
ttm_page_pool_get_pages.isra.8.constprop.10+0x3a/0xe0 [ttm]
[19174.369827]  [<ffffffffc052dea0>]
ttm_get_pages.constprop.11+0xa0/0x1f0 [ttm]
[19174.369827]  [<ffffffffc052e07c>] ttm_pool_populate+0x8c/0xf0 [ttm]
[19174.369827]  [<ffffffffc052a0f3>] ? ttm_mem_reg_ioremap+0x63/0xf0
[ttm]
[19174.369827]  [<ffffffffc056146e>] cirrus_ttm_tt_populate+0xe/0x10
[cirrus]
[19174.369827]  [<ffffffffc052a7ea>] ttm_bo_move_memcpy+0x5ea/0x650 [ttm]
[19174.369827]  [<ffffffffc05266ac>] ? ttm_tt_init+0x8c/0xb0 [ttm]
[19174.369827]  [<ffffffff811c3aee>] ? __vmalloc_node+0x3e/0x40
[19174.369827]  [<ffffffffc0561418>] cirrus_bo_move+0x18/0x20 [cirrus]
[19174.369827]  [<ffffffffc0527f5f>] ttm_bo_handle_move_mem+0x27f/0x6f0
[ttm]
[19174.369827]  [<ffffffffc0528f7c>] ttm_bo_move_buffer+0xdc/0xf0 [ttm]
[19174.369827]  [<ffffffffc0529023>] ttm_bo_validate+0x93/0xb0 [ttm]
[19174.369827]  [<ffffffffc0561c3f>] cirrus_bo_push_sysram+0x8f/0xe0
[cirrus]
[19174.369827]  [<ffffffffc055feb3>]
cirrus_crtc_do_set_base.isra.9.constprop.10+0x83/0x2b0 [cirrus]
[19174.369827]  [<ffffffff811df534>] ? kmem_cache_alloc_trace+0x1c4/0x210
[19174.369827]  [<ffffffffc056056f>] cirrus_crtc_mode_set+0x48f/0x4f0
[cirrus]
[19174.369827]  [<ffffffffc04c29de>]
drm_crtc_helper_set_mode+0x35e/0x5c0 [drm_kms_helper]
[19174.369827]  [<ffffffffc04c35f2>]
drm_crtc_helper_set_config+0x6d2/0xad0 [drm_kms_helper]
[19174.369827]  [<ffffffffc0560f9a>] ? cirrus_dirty_update+0xca/0x320
[cirrus]
[19174.369827]  [<ffffffff811df534>] ? kmem_cache_alloc_trace+0x1c4/0x210
[19174.369827]  [<ffffffffc0406026>]
drm_mode_set_config_internal+0x66/0x110 [drm]
[19174.369827]  [<ffffffffc04ceee2>]
drm_fb_helper_pan_display+0xa2/0xf0 [drm_kms_helper]
[19174.369827]  [<ffffffff814382cd>] fb_pan_display+0xbd/0x170
[19174.369827]  [<ffffffff81432629>] bit_update_start+0x29/0x60
[19174.369827]  [<ffffffff81431ee2>] fbcon_switch+0x3b2/0x560
[19174.369827]  [<ffffffff814c22f9>] redraw_screen+0x179/0x220
[19174.369827]  [<ffffffff8143024a>] fbcon_blank+0x21a/0x2d0
[19174.369827]  [<ffffffff810d0aa2>] ? wake_up_klogd+0x32/0x40
[19174.369827]  [<ffffffff810d0cd8>] ? console_unlock.part.19+0x228/0x2a0
[19174.369827]  [<ffffffff810e343c>] ? internal_add_timer+0x6c/0x90
[19174.369827]  [<ffffffff810e58d9>] ? mod_timer+0xf9/0x200
[19174.369827]  [<ffffffff814c2de0>] do_unblank_screen.part.22+0xa0/0x180
[19174.369827]  [<ffffffff814c2f0c>] do_unblank_screen+0x4c/0x80
[19174.369827]  [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100
[btrfs]
[19174.369827]  [<ffffffff814c2f50>] unblank_screen+0x10/0x20
[19174.369827]  [<ffffffff813c3ccd>] bust_spinlocks+0x1d/0x40
[19174.369827]  [<ffffffff81019bd3>] oops_end+0x43/0x120
[19174.369827]  [<ffffffff8101a2f8>] die+0x58/0x90
[19174.369827]  [<ffffffff8101642d>] do_trap+0xcd/0x160
[19174.369827]  [<ffffffff810167e6>] do_error_trap+0xe6/0x170
[19174.369827]  [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100
[btrfs]
[19174.369827]  [<ffffffff817dce0f>] ? __slab_free+0xee/0x234
[19174.369827]  [<ffffffff817dce0f>] ? __slab_free+0xee/0x234
[19174.369827]  [<ffffffffc05baf0e>] ? clear_state_bit+0xae/0x170 [btrfs]
[19174.369827]  [<ffffffffc05ba67a>] ? free_extent_state+0x6a/0xd0
[btrfs]
[19174.369827]  [<ffffffff810172e0>] do_invalid_op+0x20/0x30
[19174.369827]  [<ffffffff817f24ee>] invalid_op+0x1e/0x30
[19174.369827]  [<ffffffffc05ec1d9>] ?
free_backref_node.isra.36+0x19/0x20 [btrfs]
[19174.369827]  [<ffffffffc05ec4ba>] ? backref_cache_cleanup+0xea/0x100
[btrfs]
[19174.369827]  [<ffffffffc05ec43c>] ? backref_cache_cleanup+0x6c/0x100
[btrfs]
[19174.369827]  [<ffffffffc05f25eb>] relocate_block_group+0x2cb/0x510
[btrfs]
[19174.369827]  [<ffffffffc05f29e0>]
btrfs_relocate_block_group+0x1b0/0x2d0 [btrfs]
[19174.369827]  [<ffffffffc05c6eab>]
btrfs_relocate_chunk.isra.75+0x4b/0xd0 [btrfs]
[19174.369827]  [<ffffffffc05c82e8>] __btrfs_balance+0x348/0x460 [btrfs]
[19174.369827]  [<ffffffffc05c87b5>] btrfs_balance+0x3b5/0x5d0 [btrfs]
[19174.369827]  [<ffffffffc05d5cac>] btrfs_ioctl_balance+0x1cc/0x530
[btrfs]
[19174.369827]  [<ffffffff811b52e0>] ? handle_mm_fault+0xb0/0x160
[19174.369827]  [<ffffffffc05d7c7e>] btrfs_ioctl+0x69e/0xb20 [btrfs]
[19174.369827]  [<ffffffff8120f5b5>] do_vfs_ioctl+0x75/0x320
[19174.369827]  [<ffffffff8120f8f1>] SyS_ioctl+0x91/0xb0
[19174.369827]  [<ffffffff817f098d>] system_call_fastpath+0x16/0x1b
[19174.369827] Code: 4e 8b 2c 23 eb cd 66 0f 1f 44 00 00 48 83 c4 28
5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 be 00 10 00 00 4c 89 ef e8 a3 ee
ff ff eb c7 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00
[19174.369827] RIP  [<ffffffff8106875f>] cpa_flush_array+0x10f/0x120
[19174.369827]  RSP <ffff880367b52cf8>
[19174.369827] ---[ end trace 60adc437bd944044 ]---

After a reboot and a remount it always tried to resume the balance and
and then crashed again, so I had to be quick to do a "btrfs balance
cancel". Then I started the scrub and got these uncorrectable errors I
mentioned in the first mail.

I just unmounted it and started a btrfsck. Will post the output when it's
done.
It's already showing me several of these:

checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587
checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587
checksum verify failed on 18523667709952 found 5EAB6BFE wanted BA48D648
checksum verify failed on 18523667709952 found 8E19F60E wanted E3A34D18
checksum verify failed on 18523667709952 found C240FB11 wanted 1ED6A587
bytenr mismatch, want=18523667709952, have=10838194617263884761


Thanks,
Tobias



2015-05-28 4:49 GMT+02:00 Qu Wenruo <quwen...@cn.fujitsu.com>:




-------- Original Message  --------
Subject: Uncorrectable errors on RAID6
From: Tobias Holst <to...@tobby.eu>
To: linux-btrfs@vger.kernel.org <linux-btrfs@vger.kernel.org>
Date: 2015年05月28日 10:18

Hi

I am doing a scrub on my 6-drive btrfs RAID6. Last time it found zero
errors, but now I am getting this in my log:

[ 6610.888020] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6610.888025] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6610.888029] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, corrupt
1,
gen 0
[ 6611.271334] BTRFS: unable to fixup (regular) error at logical
478232346624 on dev /dev/dm-2
[ 6611.831370] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6611.831373] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6611.831375] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, corrupt
2,
gen 0
[ 6612.396402] BTRFS: unable to fixup (regular) error at logical
478232346624 on dev /dev/dm-2
[ 6904.027456] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6904.027460] BTRFS: checksum error at logical 478232346624 on dev
/dev/dm-2, sector 231373760: metadata leaf (level 0) in tree 2
[ 6904.027463] BTRFS: bdev /dev/dm-2 errs: wr 0, rd 0, flush 0, corrupt
3,
gen 0

Looks like it is always the same sector.

"btrfs balance status" shows me:
scrub status for a34ce68b-bb9f-49f0-91fe-21a924ef11ae
           scrub started at Thu May 28 02:25:31 2015, running for 6759
seconds
           total bytes scrubbed: 448.87GiB with 14 errors
           error details: read=8 csum=6
           corrected errors: 3, uncorrectable errors: 11, unverified
errors:
0

What does it mean and why are these erros uncorrectable even on a
RAID6?
Can I find out, which files are affected?



If it's OK for you to put the fs offline,
btrfsck is the best method to check what happens, although it may take a
long time.

There is a known bug that replace can cause checksum error, found by
Zhao
Lei.
So did you run replace while there is still some other disk I/O happens?

Thanks,
Qu




system: Ubuntu 14.04.2
kernel version 4.0.4
btrfs-tools version: 4.0

Regards
Tobias
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Uncorrectable errors on RAID6

Reply via email to