On 10/10/2018 2:20 PM, Holger Hoffstätte wrote:
On 10/10/18 19:25, Larkin Lowrey wrote:
On 10/10/2018 12:04 PM, Holger Hoffstätte wrote:
On 10/10/18 17:44, Larkin Lowrey wrote:
(..)
About once a week, or so, I'm running into the above situation where
FS seems to deadlock. All IO to the FS blocks, there is no IO
activity at all. I have to hard reboot the system to recover. There
are no error indications except for the following which occurs well
before the FS freezes up:
BTRFS warning (device dm-3): block group 78691883286528 has wrong
amount of free space
BTRFS warning (device dm-3): failed to load free space cache for
block group 78691883286528, rebuilding it now
Do I have any options other the nuking the FS and starting over?
Unmount cleanly & mount again with -o space_cache=v2.
It froze while unmounting. The attached zip is a stack dump captured
via 'echo t > /proc/sysrq-trigger'. A second attempt after a hard
reboot worked.
Trace says freespace cache writeout failed midway while the scsi device
was resetting itself and then went aaaarrrghh. Probably managed to hit
different blocks on the second attempt. So chances are your controller,
disk or something else is broken, dying, or both.
When things have settled and you have verified that r/o mounting works
and is stable, try rescuing the data (when necessary) before scrubbing,
dm-device-checking or whatever you have set up.
Interesting, because I do not see any indications of any other errors.
The fs is backed by an mdraid array and the raid checks always pass with
no mismatches, edac-util doesn't report any ECC errors, smartd doesn't
report any SMART errors, and I never see any raid controller errors. I
have the console connected through serial to a logging console server so
if there were errors reported I would have seen them.
--Larkin