Rich Freeman posted on Thu, 22 Sep 2016 07:18:35 -0500 as excerpted:
> I have been getting panics consistently after doing a btrfs replace
> operation on a raid1 and rebooting. I linked a photo of the panic; I
> haven't been able to get a text capture of it.
> I'm getting this error on the latest 4.4, 4.1, and even on an old
> 3.18.26 kernel I had lying around.
> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
> returns patch on 4.1 and that did not solve my problem either.
> I'm able to boot into single-user mode and if I don't start any
> processes the system seems fairly stable. I am also able to start a
> btrfs balance and run that for several hours without issue. If I start
> launching services the system will tend to panic, though how many
> processes I can launch will vary. I don't think that it is a particular
> file being accessed that is triggering the issue since the point where
> it fails varies. I suspect it may be load-related.
> Mounting with compress=no doesn't seem to help either. Granted, I see
> lzo_decompress in the backtrace and that is probably a read operation.
> Any suggestions? Google hasn't been helpful on this one...
Btrfs raid1 you say, and you have existing compressed files it's trying
to read in the backtrace?
Sounds like the issues I see sometimes and have posted about where after
a crash that resulted in one device of my raid1 pair getting behind the
other, the kernel will crash if it sees too many csum-errors, even tho
it's /supposed/ to check the other copy and read from it if valid (which
it is as a btrfs scrub resolves the issue).
When booted to rescue/single-user mode, can you run a scrub? If it's the
csum-related problem I see and the replace worked, a scrub should
complete fine, repairing the bad copy from the mirror, and the problem
should be resolved. If the replace bugged out and you now have only one
copy of some chunks, if scrub finds an error there it obviously won't be
able to repair from the good mirror, but it should at least spot some csum
errors it can't repair.
If a scrub crashes too, if it completes without finding any errors to
correct, or if it finds and corrects errors but the issue persists, then
it's unlikely to be the issue I've seen.
FWIW, the issue I've seen appears to be related to attempts to read
compressed files. It does not appear to affect users who don't have any
such files or do but they're simply not accessed in ordinary operations.
It may or may not affect other than raid1 and likely raid10, but they
make it easiest to verify due to the possibility of one copy getting out
of sync with the other, and due to scrub's ability to confirm that as the
problem as it can repair the bad copy from the good one, which the kernel
should do dynamically as well, but that's where the bug is as too many
dynamic csum errors trigger a crash even when there's a second copy
available, that scrub later verifies as valid.
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html