On 2019/1/16 上午9:38, Chris Murphy wrote: > On Tue, Jan 15, 2019 at 5:04 AM David Sterba <[email protected]> wrote: >> >> On Tue, Jan 15, 2019 at 07:48:47PM +0800, Qu Wenruo wrote: >>> Super nice move, it shows the corruption and the cause. >>> >>> item 66 key (1714119835648 METADATA_ITEM 0) itemoff 13325 itemsize 33 >>> item 67 key (10510212874240 METADATA_ITEM 0) itemoff 13283 itemsize 42 >>> item 68 key (1714119868416 METADATA_ITEM 0) itemoff 13250 itemsize 33 >> >> The key order is the most frequent and also very reliable report of the >> memory bitlips. I think we should add an unconditional check before a >> leaf or node is written so we catch such errors before the bad data hit >> the disk. >> >> This seems to happen way too often, I believe the check overhead would >> be acceptable and at least give early warning. > > What about out of tree or proprietary modules tainting the kernel?
For XPS13 there is no dedicated GPU on board, so no NVidia bullsh*t. And I don't really think it's proprietary modules. > Or > other corruptions we see that aren't key order related, like the > several recent "unable to find ref byte" reports? I'm not super clear on extent tree corruption. but I really don't think they are the same bug. > Are these memory > corruption related, or are they non-Btrfs bugs causing such > corruption? Does it make any sense for users who are running > proprietary or out of tree kernels to run with slub_debug=F or even > FZP and possibly get a better idea what category the corruption is in? Anyway, I'm working on the idea David mentioned. Hopes soon we will get a more early detection to get some clue. > > I guess what I'm getting at is, users get a corrupt file system, they > can't repair it (honestly the tools are not good enough, and aren't > user friendly), Definitely. > so we tell them OK just start over with a new file > system. It would be better if there's some additional advice to give > them to try and find out what caused the corruption to begin with, > rather than just start over and maybe run into the same problem again. Obviously, current tree checker is already too late for such case. But if we catch them just before writing to disk, then it'll be much better. User won't get a corrupted fs, and we will get a clue, then everyone is happy. Thanks, Qu > >
signature.asc
Description: OpenPGP digital signature
