Rich Freeman posted on Thu, 22 Sep 2016 07:18:35 -0500 as excerpted: > I have been getting panics consistently after doing a btrfs replace > operation on a raid1 and rebooting. I linked a photo of the panic; I > haven't been able to get a text capture of it. > > https://ibin.co/2vx0HhDeViu3.jpg > > I'm getting this error on the latest 4.4, 4.1, and even on an old > 3.18.26 kernel I had lying around. > > I tried the remove root_log_ctx from ctx list before btrfs_sync_log > returns patch on 4.1 and that did not solve my problem either. > > I'm able to boot into single-user mode and if I don't start any > processes the system seems fairly stable. I am also able to start a > btrfs balance and run that for several hours without issue. If I start > launching services the system will tend to panic, though how many > processes I can launch will vary. I don't think that it is a particular > file being accessed that is triggering the issue since the point where > it fails varies. I suspect it may be load-related. > > Mounting with compress=no doesn't seem to help either. Granted, I see > lzo_decompress in the backtrace and that is probably a read operation. > > Any suggestions? Google hasn't been helpful on this one...
Btrfs raid1 you say, and you have existing compressed files it's trying to read in the backtrace? Sounds like the issues I see sometimes and have posted about where after a crash that resulted in one device of my raid1 pair getting behind the other, the kernel will crash if it sees too many csum-errors, even tho it's /supposed/ to check the other copy and read from it if valid (which it is as a btrfs scrub resolves the issue). When booted to rescue/single-user mode, can you run a scrub? If it's the csum-related problem I see and the replace worked, a scrub should complete fine, repairing the bad copy from the mirror, and the problem should be resolved. If the replace bugged out and you now have only one copy of some chunks, if scrub finds an error there it obviously won't be able to repair from the good mirror, but it should at least spot some csum errors it can't repair. If a scrub crashes too, if it completes without finding any errors to correct, or if it finds and corrects errors but the issue persists, then it's unlikely to be the issue I've seen. FWIW, the issue I've seen appears to be related to attempts to read compressed files. It does not appear to affect users who don't have any such files or do but they're simply not accessed in ordinary operations. It may or may not affect other than raid1 and likely raid10, but they make it easiest to verify due to the possibility of one copy getting out of sync with the other, and due to scrub's ability to confirm that as the problem as it can repair the bad copy from the good one, which the kernel should do dynamically as well, but that's where the bug is as too many dynamic csum errors trigger a crash even when there's a second copy available, that scrub later verifies as valid. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html