On 2017年10月23日 17:17, Qu Wenruo wrote: > > > On 2017年10月23日 16:39, Wolf wrote: >> Hi, >> I'm having problem with corruption in one file on my disk array. This is >> third time it happened (probably). First time I didn't checked the >> offending file so I'm not sure but it's likely. Btrfs scrub finds the >> corruption, according to both dmesg and it's output it fixes it. >> However, next run finds it too. >> >> However, according to SMART the disk appears to be healthy (see below). >> Plus the corruption is limited to one file. >> >> Is this and issue somewhere inside btrfs or is disk HW related problem? >> >> Thank you for your help :) >> >> W. >> >> smartctl -a /dev/sde >> ==================== >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED >> WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always >> - 0 >> 2 Throughput_Performance 0x0005 131 131 054 Pre-fail Offline >> - 116 >> 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always >> - 0 >> 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always >> - 8 >> 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always >> - 0 >> 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always >> - 0 >> 8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline >> - 15 >> 9 Power_On_Hours 0x0012 100 100 000 Old_age Always >> - 401 >> 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always >> - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always >> - 8 >> 22 Unknown_Attribute 0x0023 100 100 025 Pre-fail Always >> - 100 >> 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always >> - 33 >> 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always >> - 33 >> 194 Temperature_Celsius 0x0002 147 147 000 Old_age Always >> - 44 (Min/Max 23/46) >> 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always >> - 0 >> 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always >> - 0 >> 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline >> - 0 >> 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always >> - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining LifeTime(hours) >> LBA_of_first_error >> # 1 Extended offline Completed without error 00% 357 >> - >> # 2 Short offline Completed without error 00% 335 >> - >> >> uname -a >> ======== >> >> Linux ws 4.13.8-1-ARCH #1 SMP PREEMPT Wed Oct 18 11:49:44 CEST 2017 x86_64 >> GNU/Linux >> >> btrfs --version >> =============== >> >> btrfs-progs v4.13 >> >> btrfs fi show >> ============= >> >> Label: none uuid: db7e86f5-649d-44ce-9514-53c7ee0fbe09 >> Total devices 2 FS bytes used 9.91GiB >> devid 1 size 103.79GiB used 20.03GiB path /dev/mapper/storage1-root >> devid 2 size 103.79GiB used 20.03GiB path /dev/mapper/storage2-root >> >> Label: 'RAID' uuid: 9a4be3ac-e942-4e6a-bb24-2c4009a42572 >> Total devices 7 FS bytes used 6.48TiB >> devid 1 size 1.82TiB used 715.03GiB path /dev/mapper/data3 >> devid 2 size 1.82TiB used 715.00GiB path /dev/mapper/data4 >> devid 3 size 2.73TiB used 1.40TiB path /dev/mapper/data2 >> devid 4 size 2.73TiB used 1.40TiB path /dev/mapper/data1 >> devid 5 size 2.73TiB used 1.40TiB path /dev/mapper/data5 >> devid 6 size 2.73TiB used 1.40TiB path /dev/mapper/data6 >> devid 7 size 7.28TiB used 5.95TiB path /dev/mapper/data7 >> >> btrfs fi df /raid >> ================= >> >> Data, RAID1: total=6.47TiB, used=6.47TiB >> System, RAID1: total=64.00MiB, used=944.00KiB >> Metadata, RAID1: total=9.00GiB, used=7.56GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B > > RAID1 for both data and meta. > So if nothing went wrong, it should be fixed. > > And IIRC RAID1 repair is already tested and checked, so it should not > has such problem. > >> >> dmesg >> ===== >> >> [ 0.000000] microcode: microcode updated early to revision 0xba, date = >> 2017-04-09 >> [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x4b7 >> with crng_init=0 >> [ 0.000000] Linux version 4.13.8-1-ARCH (builduser@tobias) (gcc version >> 7.2.0 (GCC)) #1 SMP PREEMPT Wed Oct 18 11:49:44 CEST 2017 > > Arch user here too. > >> [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux >> root=UUID=db7e86f5-649d-44ce-9514-53c7ee0fbe09 rw >> cryptdevice=UUID=eb4011d2-38cd-467d-b515-7acf3ef68f01:storage1:allow-discards >> cryptkey=rootfs:/boot/crypto_keyfile.bin >> cryptdevice2=UUID=dd0821ae-8fc4-41d2-aab8-f313e2f6d0e8:storage2:allow-discards >> cryptkey2=rootfs:/boot/crypto_keyfile2.bin root=/dev/mapper/storage1-root > [snip] >> [ 5499.268721] lxc_bridge: port 3(vethRUY5VX) entered blocking state >> [ 5499.268729] lxc_bridge: port 3(vethRUY5VX) entered disabled state >> [ 5499.268759] device vethRUY5VX entered promiscuous mode >> [ 5499.268863] IPv6: ADDRCONF(NETDEV_UP): vethRUY5VX: link is not ready >> [ 5499.296331] eth0: renamed from vethTO2H20 >> [ 5499.321534] IPv6: ADDRCONF(NETDEV_CHANGE): vethRUY5VX: link becomes ready >> [ 5499.321603] lxc_bridge: port 3(vethRUY5VX) entered blocking state >> [ 5499.321608] lxc_bridge: port 3(vethRUY5VX) entered forwarding state >> [22671.887860] perf: interrupt took too long (2508 > 2500), lowering >> kernel.perf_event_max_sample_rate to 79500 >> [24159.239247] perf: interrupt took too long (3154 > 3135), lowering >> kernel.perf_event_max_sample_rate to 63300 >> [27240.680874] perf: interrupt took too long (3952 > 3942), lowering >> kernel.perf_event_max_sample_rate to 50400 >> [30658.875802] BTRFS warning (device dm-12): checksum error at logical >> 37889245122560 on dev /dev/mapper/data7, sector 2743145096, root 23674, >> inode 206751, offset 762638336, length 4096, links 1 (path: >> アニメ/!waiting_for_better_quality/Gate: Jieitai Kanochi nite, Kaku >> Tatakaeri/GATE Jieitai Kanochi nite, Kaku Tatakaeri 05v2.mp4) > > Well, it's several seasons ago, and I think there are better BDrip raws now. > (Yeah, I'm also an Otaku) > > Despite that, it's better to hide such personal info though. > > And, did you tried to scrub the corrupted device other than the whole fs? > Btrfs default scrub will start threads to scrub all devices at the same > time, maybe some concurrency caused the false alert. > > > Also, it could be possible to check/repair it by using btrfs-progs. > Although it's still out-of-tree. > > Could you please try the following branch and use "btrfs scrub start > --offline /dev/mapper/data7" to check if it reports the corruption is > fixable? > https://github.com/gujx2017/btrfs-progs/tree/offline_scrub > > Offline scrub gives us a quite good reference on whether it's fixable, > without the possible hassle in kernel. > So it's worth trying. > > (But hey, there is better better BDrip raws already, so I don't think > you're really interested in fixing the corruption) > > Thanks, > Qu >> [30658.875806] BTRFS error (device dm-12): bdev /dev/mapper/data7 errs: wr >> 0, rd 0, flush 0, corrupt 1, gen 0 >> [30658.900322] BTRFS error (device dm-12): fixed up error at logical >> 37889245122560 on dev /dev/mapper/data7 >> [30658.902461] BTRFS warning (device dm-12): checksum error at logical >> 37889245126656 on dev /dev/mapper/data7, sector 2743145104, root 23674, >> inode 206751, offset 762642432, length 4096, links 1 (path: >> アニメ/!waiting_for_better_quality/Gate: Jieitai Kanochi nite, Kaku >> Tatakaeri/GATE Jieitai Kanochi nite, Kaku Tatakaeri 05v2.mp4) >> [30658.902471] BTRFS error (device dm-12): bdev /dev/mapper/data7 errs: wr >> 0, rd 0, flush 0, corrupt 2, gen 0 >> [30658.904943] BTRFS error (device dm-12): fixed up error at logical >> 37889245126656 on dev /dev/mapper/data7 >> [30658.905427] BTRFS warning (device dm-12): checksum error at logical >> 37889245130752 on dev /dev/mapper/data7, sector 2743145112, root 23674, >> inode 206751, offset 762646528, length 4096, links 1 (path: >> アニメ/!waiting_for_better_quality/Gate: Jieitai Kanochi nite, Kaku >> Tatakaeri/GATE Jieitai Kanochi nite, Kaku Tatakaeri 05v2.mp4) >> [30658.905435] BTRFS error (device dm-12): bdev /dev/mapper/data7 errs: wr >> 0, rd 0, flush 0, corrupt 3, gen 0 >> [30658.908247] BTRFS error (device dm-12): fixed up error at logical >> 37889245130752 on dev /dev/mapper/data7 >> [30658.912217] BTRFS warning (device dm-12): checksum error at logical >> 37889245134848 on dev /dev/mapper/data7, sector 2743145120, root 23674, >> inode 206751, offset 762650624, length 4096, links 1 (path: >> アニメ/!waiting_for_better_quality/Gate: Jieitai Kanochi nite, Kaku >> Tatakaeri/GATE Jieitai Kanochi nite, Kaku Tatakaeri 05v2.mp4) >> [30658.912226] BTRFS error (device dm-12): bdev /dev/mapper/data7 errs: wr >> 0, rd 0, flush 0, corrupt 4, gen 0 >> [30658.922809] BTRFS error (device dm-12): fixed up error at logical >> 37889245134848 on dev /dev/mapper/data7 >> [30658.924887] BTRFS warning (device dm-12): checksum error at logical >> 37889245138944 on dev /dev/mapper/data7, sector 2743145128, root 23674, >> inode 206751, offset 762654720, length 4096, links 1 (path: >> アニメ/!waiting_for_better_quality/Gate: Jieitai Kanochi nite, Kaku >> Tatakaeri/GATE Jieitai Kanochi nite, Kaku Tatakaeri 05v2.mp4) >> [30658.924894] BTRFS error (device dm-12): bdev /dev/mapper/data7 errs: wr >> 0, rd 0, flush 0, corrupt 5, gen 0 >> [30658.926522] BTRFS error (device dm-12): fixed up error at logical >> 37889245138944 on dev /dev/mapper/data7 >> [33389.149353] perf: interrupt took too long (4945 > 4940), lowering >> kernel.perf_event_max_sample_rate to 40200 >>
BTW, what the dmesg is presenting? The first scrub or the 2nd scrub also included? At least from the dmesg, it's just saying 5 different blocks (20K) get corrupted and fixed. Without the dmesg for 2nd scrub, I can't be sure if kernel is reporting the corruption again. Thanks, Qu
signature.asc
Description: OpenPGP digital signature