I'm running my normal workstation with git kernels from 
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git
and just got the second file system corruption in three weeks. I do
not have issues with stable kernels, and just want to give you a
heads up that there might be something seriously broken in current
development kernels.

The first corruption was with a kernel based on 4.18.0-rc1
(wt-2018-06-20) and the second one today based on 4.18.0-rc4
(wt-2018-07-09).
The first corruption definitely destroyed data, the second one has
not been looked at all, yet.

After the reinstall I did run some scrubs, the last working one one
week ago.

Of course this could be unrelated to the development kernels or even
btrfs, but two corruptions within weeks after years without problems
is very suspect.
And since btrfs also allowed to read corrupted data (with a stable
ubuntu kernel, see below for more details) it looks like this is
indeed an issue in btrfs, correct?

A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO
mSATA 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard
is enabled as mount option and there were roughly 5 other
subvolumes.

I'm currently backing up the full btrfs partition after the second
corruption which announced itself with the following log entries:

[  979.223767] BTRFS critical (device sdc2): corrupt leaf: root=2
block=1029783552 slot=1, unexpected item end, have 16161 expect
16250

    This means that the metadata block matches the checksum in its
header, but is internally inconsistent. This means that the error in
the block was made before the csum was computed -- i.e., it was that
way in RAM. This can happen in a couple of different ways, but the
most likely cause is bad RAM.

    In this case, it's not a single bitflip in the metadata page
itself, so it's more likely to be something writing spurious data on
the page in RAM that was holding this metadata block. This is either a
bug in the kernel, or a hardware problem.

    I would strongly recommend checking your RAM (memtest86 for a
minimum of 8 hours, preferably 24).

The system has 24G of ram but since the reinstalled was compiling the complete OS from scratch (with a stable kernel) I would have expected to hit the bad ram there also and kind of ignored that possibility. I'll run the tests and also report back on that.

[  979.223808] BTRFS: error (device sdc2) in __btrfs_cow_block:1080:
errno=-5 IO failure
[  979.223810] BTRFS info (device sdc2): forced readonly
[  979.224599] BTRFS warning (device sdc2): Skipping commit of
aborted transaction.
[  979.224603] BTRFS: error (device sdc2) in
cleanup_transaction:1847: errno=-5 IO failure

I'll restore the system from a backup - and stick to stable kernels
for now - after that, but if needed I can of course also restore the
partition backup to another disk for testing.

    It may be a kernel issue, but it's not necessarily in btrfs. It
could be a bug in some other kernel component where it does some
pointer arithmetic wrong, or uses some uninitialised data as a
pointer. My money's is on bad RAM, though (by a small margin).


I also had two out of tree kernel modules:
https://github.com/hhfeuer/nvhda and the gentoo packaged version of https://github.com/mkottman/acpi_call

Alexander
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to