Hi
A few weeks ago I was a bit under the weather for a week, and so was
unable to give my home server any attention.  Unfortunately during
this time it started a scheduled scrub and decided there was billions
of errors.  When I noticed this, maybe not thinking entirely straight
(i.e. not trying to find out what was causing the errors), I instantly
stopped the scrub and rebooted the server.  The volume didn't mount,
but still feeling unwell I just went back to bed.
When I was feeling better a few days later I tried to mount it, and
noticed I'd got a lot of transid error messages.  I'd had a similar
problem before over a year ago, but couldn't quite remember how I'd
fixed it.  A quick bit of googling later I found a post that suggested
using the check --init-extent-tree which sounded familiar and gave
that a go.  I realise this was  a bit silly, and should have looked a
bit further, but never mind.  Anyway, it literally ran for over a week
(a 38TB volume) and it eventually stopped saying it had ran out of
space.
Feeling fully functional by this stage, I did a bit more digging, and
noticed a few of my drives kept on going intermittently off-line and
tracked the culprit to a bad SAS fan-out cable.  Replaced the cable
and did a check (no repair), and saw that there were still loads of
errors, but the drives were now staying on-line.  I then tried a check
with --repair, which after listing a few transid errors, bombed out
saying it was unable to repair root items, as the operation was not
permitted  (did google this error, with not much coming up on it).
At this point I tried mounting it again (-o usebackuproot,ro) and it
actually worked (although dmesg reported transid errors still), and I
was able to read the last few files I'd written to it before I'd
become ill.  I also tried a scrub to see if  that would fix
anything,but that just bombed out.  So I then tried mounting with just
usebackuproot, which also worked (again with transid errors), tried a
scrub again in case it has bombed out previously due to the ro option,
but alas it didn't work.  Final, just I tried mounting it without any
options and again it worked (again with transid errors), but again I
couldn't get it to scrub.  As I am using Ubuntu 17.10 (so btrfs-progs
4.12), I compiled a static build of the git version of btrfs-progs
4.15 to see if using the check and scrub features in a newer build
maybe able to fix thing, but alas I just get the same errors.
During all of this, I've not written anything to the drives, apart
from what the scrubs check may do (i.e. I've not written anymore files
to the drives) as there is still something very wrong, even through I
seem to be able to read the files okay (obviously not tried them all,
just most recent and a couple of very old ones).  I've cut and past
the current output of dmesg and/or journalctl when
mounting/scrubbing/checking the drives.  As you can see there are 4
drives with large amount of listed read/write/corrupt errors which I
assume to the old errors caused by the bad cable (they don;t seem to
be increasing).  Also some of the transid generation numbers are
close-ish and some miles off, I assume this is due to corrupted
generation numbers being written, as I'd expect them to be a similar
amount out, not such a wide range as I have.
Basically, as the data seems readable I'd love to run a command that
will just tell it to change the generation numbers to what they should
be and clear the error counts, but I'm at a loss to find such a
command (is this what zero-log does, I'm a bit hesitant to use this
until I hear from someone in the know).  Also as check finished early
due to the permissions error with root items and scrub not seeming to
be able to handle the transid errors without finishing early I need a
way to get around these too.
Anyway, I hope this all makes sense, and sorry for it being so long,
but thought it best to give as much detail as possible.  Thank you for
your help.
Alex

Output:
uname -a:
Linux TheMatrix 4.13.0-32-generic #35-Ubuntu SMP Thu Jan 25 09:13:46
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version:
btrfs-progs v4.12
btrfs.static --version:
btrfs-progs v4.15

btrfs filesystem show /mnt/btrfs/broken:
Label: 'Videos'  uuid: db40f1f9-6239-4db0-87d9-bbce4affdb43
        Total devices 11 FS bytes used 38.18TiB
        devid    1 size 3.64TiB used 3.45TiB path /dev/sds
        devid    2 size 3.64TiB used 3.45TiB path /dev/sdq
        devid    3 size 3.64TiB used 3.45TiB path /dev/sdi
        devid    4 size 3.64TiB used 3.45TiB path /dev/sdj
        devid    5 size 3.64TiB used 3.45TiB path /dev/sdd
        devid    6 size 3.64TiB used 3.56TiB path /dev/sde
        devid    7 size 3.64TiB used 3.56TiB path /dev/sdh
        devid    8 size 3.64TiB used 3.56TiB path /dev/sdr
        devid    9 size 3.64TiB used 3.56TiB path /dev/sdk
        devid   10 size 3.64TiB used 3.56TiB path /dev/sdf
        devid   11 size 3.64TiB used 3.56TiB path /dev/sdp

btrfs filesystem df /mnt/btrfs/broken:
Data, RAID0: total=38.54TiB, used=38.14TiB
System, RAID1: total=64.00MiB, used=3.30MiB
Metadata, RAID1: total=45.61GiB, used=42.01GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

During mount /dev/sdd /mnt/btrfs/broken:
Nothing via stdout
dmesg:
[1216197.622816] BTRFS info (device sdf): disk space caching is enabled
[1216197.622817] BTRFS info (device sdf): has skinny extents
[1216197.845195] BTRFS info (device sdf): bdev /dev/sdf errs: wr 487,
rd 430, flush 0, corrupt 0, gen 8
[1216197.845198] BTRFS info (device sdf): bdev /dev/sdd errs: wr
5292890, rd 693156626, flush 0, corrupt $
, gen 10
[1216197.845201] BTRFS info (device sdf): bdev /dev/sdp errs: wr 2316,
rd 678, flush 0, corrupt 3072, gen
0
[1216197.845202] BTRFS info (device sdf): bdev /dev/sdi errs: wr 335,
rd 615, flush 0, corrupt 0, gen 0
[1216197.845205] BTRFS info (device sdf): bdev /dev/sds errs: wr
98077, rd 114295, flush 0, corrupt 0, ge$
 144
[1216197.845207] BTRFS info (device sdf): bdev /dev/sdr errs: wr 456,
rd 400, flush 0, corrupt 0, gen 0
[1216210.463895] BTRFS info (device sdf): checking UUID tree
[1216211.577078] verify_parent_transid: 2 callbacks suppressed
[1216211.577080] BTRFS error (device sdf): parent transid verify
failed on 50124166119424 wanted 347306 found 15892315202] BTRFS info
(device sdf): bdev /dev/sdi errs: wr 335, rd 615, flush 0, corrupt
[1216211.594138] BTRFS error (device sdf): parent transid verify
failed on 50124166119424 wanted 347306 found 1589231
[1216211.595063] BTRFS warning (device sdf): btrfs_uuid_scan_kthread failed -5
[1216242.108296] BTRFS warning (device sdf): block group
50124114100224 has wrong amount of free space
[1216242.108298] BTRFS warning (device sdf): failed to load free space
cache for block group 50124114100224, rebuilding it now
[1216246.641588] BTRFS info (device sdf): The free space cache file
(49531408613376) is invalid. skip it
[1216247.324292] BTRFS info (device sdf): The free space cache file
(49821318905856) is invalid. skip it
[1216247.621971] BTRFS info (device sdf): The free space cache file
(49866416062464) is invalid. skip it
[1216247.850904] BTRFS info (device sdf): The free space cache file
(49924398120960) is invalid. skip it
[1216248.907827] BTRFS info (device sdf): The free space cache file
(50279806664704) is invalid. skip it
[1216254.083562] BTRFS info (device sdf): The free space cache file
(52395078057984) is invalid. skip it
[1216260.278069] BTRFS info (device sdf): The free space cache file
(54948436115456) is invalid. skip it
[1216266.747547] BTRFS info (device sdf): The free space cache file
(57407304892416) is invalid. skip it
[1216267.730967] BTRFS info (device sdf): The free space cache file
(57711173828608) is invalid. skip it

During btrfs check --repair --progress /dev/sdd:
stdout:
enabling repair mode
warning, bad space info total_bytes 2147483648 used 2169978880
Checking filesystem on /dev/sdd
UUID: db40f1f9-6239-4db0-87d9-bbce4affdb43
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
Ignoring transid failure
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
Ignoring transid failure
leaf parent key incorrect 50124166119424
bad block 50124166119424

ERROR: errors found in extent allocation tree or chunk allocation
parent transid verify failed on 50124166119424 wanted 347306 found 1589231
Ignoring transid failure
leaf parent key incorrect 50124166119424
ERROR: failed to repair root items: Operation not permitted

No dmesg output

During btrfs scrub start -Bd /mnt/btrfs/broken:
stdout:
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 1: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 2: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 3: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 4: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 5: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 6: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 7: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 8: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 9: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 10: ret=-1,
errno=5 (Input/output error)
ERROR: scrubbing /mnt/btrfs1/broken failed for device id 11: ret=-1,
errno=5 (Input/output error)
scrub device /dev/sds (id 1) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:01:12
        total bytes scrubbed: 7.44GiB with 0 errors
scrub device /dev/sdq (id 2) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:01:12
        total bytes scrubbed: 7.58GiB with 0 errors
scrub device /dev/sdi (id 3) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:01:29
        total bytes scrubbed: 9.51GiB with 0 errors
scrub device /dev/sdj (id 4) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:02:05
        total bytes scrubbed: 12.23GiB with 0 errors
scrub device /dev/sdd (id 5) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:06:08
        total bytes scrubbed: 47.41GiB with 0 errors
scrub device /dev/sde (id 6) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:02:49
        total bytes scrubbed: 18.83GiB with 0 errors
scrub device /dev/sdh (id 7) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:02:09
        total bytes scrubbed: 13.52GiB with 0 errors
scrub device /dev/sdr (id 8) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:00:54
        total bytes scrubbed: 5.36GiB with 0 errors
scrub device /dev/sdk (id 9) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:01:00
        total bytes scrubbed: 6.32GiB with 0 errors
scrub device /dev/sdf (id 10) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:00:52
        total bytes scrubbed: 5.98GiB with 0 errors
scrub device /dev/sdp (id 11) canceled
        scrub started at Thu Feb 15 19:10:31 2018 and was aborted after 00:00:52
        total bytes scrubbed: 5.98GiB with 0 errors

dmesg:
[1216802.322219] BTRFS error (device sdf): parent transid verify
failed on 50124160401408 wanted 65598 found 1522534
[1216802.356205] BTRFS error (device sdf): parent transid verify
failed on 50124160401408 wanted 65598 found 1522534
[1216802.400389] BTRFS error (device sdf): parent transid verify
failed on 50124160401408 wanted 65598 found 1522534
[1216802.501807] BTRFS error (device sdf): parent transid verify
failed on 50124160401408 wanted 65598 found 1522534
[1216802.503638] BTRFS error (device sdf): parent transid verify
failed on 50124160401408 wanted 65598 found 1522534
[1216804.677302] BTRFS error (device sdf): parent transid verify
failed on 50124133826560 wanted 287542 found 1213960
[1216804.692425] BTRFS error (device sdf): parent transid verify
failed on 50124133826560 wanted 287542 found 1213960
[1216810.010117] BTRFS error (device sdf): parent transid verify
failed on 50124126388224 wanted 78735 found 1125448
[1216810.029168] BTRFS error (device sdf): parent transid verify
failed on 50124126388224 wanted 78735 found 1125448
[1216810.045383] BTRFS error (device sdf): parent transid verify
failed on 50124126388224 wanted 78735 found 1125448
[1216822.251380] BTRFS error (device sdf): parent transid verify
failed on 50124153438208 wanted 271274 found 2545637
[1216822.263511] BTRFS error (device sdf): parent transid verify
failed on 50124153438208 wanted 271274 found 2545637
[1216822.325286] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216822.332262] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216839.859689] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216839.873995] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216875.849979] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216875.857510] BTRFS error (device sdf): parent transid verify
failed on 50124152291328 wanted 271274 found 2297672
[1216879.640796] BTRFS error (device sdf): parent transid verify
failed on 50124152799232 wanted 208777 found 2338800
[1216879.644477] BTRFS error (device sdf): parent transid verify
failed on 50124152799232 wanted 208777 found 2338800
[1216919.772617] BTRFS error (device sdf): parent transid verify
failed on 50124145115136 wanted 279114 found 1347923
[1216919.794803] BTRFS error (device sdf): parent transid verify
failed on 50124145115136 wanted 279114 found 1347923
[1217118.374837] BTRFS error (device sdf): parent transid verify
failed on 50124148342784 wanted 270180 found 1383517
[1217118.397785] BTRFS error (device sdf): parent transid verify
failed on 50124148342784 wanted 270180 found 1383517
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to