Re: first it froze, now the (btrfs) root fs won't mount ...

Austin S. Hemmelgarn Mon, 21 Oct 2019 07:02:43 -0700

On 2019-10-21 09:02, Christian Pernegger wrote:

[Please CC me, I'm not on the list.]


Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
<[email protected]>:

I've [worked with fs clones] like this dozens of times on single-device volumes 
with exactly zero issues.


Thank you, I have taken precautions, but it does seem to work fine.

There are actually two possible ways I can think of a buggy GPU driver causing 
this type of issue: [snip]


Interesting and plausible, but ...

Your best option for mitigation [...] is to ensure that your hardware has an 
IOMMU [...] and ensure it's enabled in firmware.


It has and it is. (The machine's been specced so GPU pass-through is
an option, should it be required. I haven't gotten around to setting
that up yet, haven't even gotten a second GPU, but I have laid the
groundwork, the IOMMU is enabled and, as far as one can tell from logs
and such, working.)

However, there's also the possibility that you may have hardware issues.


Don't I know it ... The problem is, if there are hardware issues,
that's the first I've seen of them, and while I didn't run torture
tests, there was quite a lot of benchmarking when it was new. Needle
in a haystack. Some memory testing can't hurt, I suppose. Any other
ideas (for hardware testing)?

The power supply would be the other big one I'd suggest testing, as abad PSU can cause all kinds of odd intermittent issues. Just like withRAM, you can't really easily cover everything, but you can check somethings that have very low false negative rates when indicating problems.


Typical procedure I use is:

1. Completely disconnect the PSU from _everything_ inside the computer.(If you're really paranoid, you can completely remove the PSU from thecase too, though that won't really make the testing more reliable or safer).2. Make sure the PSU itself is plugged in to mains power, with theswitch on the back (if it has one) turned on.3. Connect a good multimeter to the 24-pin main power connector, withthe positive probe on pin 8 and the negative probe on pin 7, set tomeasure DC voltages in the double-digit range with the highest precisionpossible.4. Short pins 15 and 16 of the 24-pin main power connector using a shortpiece of solid copper wire. At this point, if the PSU has a fan, the fanshould turn on. The multimeter should read +5 volts within half a secondor less.5. Check voltages of each of the power rails relative to ground. Makesure and check each one for a couple of seconds to watch for anyfluctuations, and make a point to check _each_ set of wires coming offof the PSU separately (as well as checking each wire in each connectorindependently, even if they're supposed to be tied together internally).6. Check the =5V standby power by hooking up the multimeter to that anda ground pin, then disconnecting the copper wire mentioned in step 3.It should maintain it's voltage while you're disconnecting the wire andafterwards, even once the fan stops.

You can find the respective pinouts online in many places (for example,[1]). Tolerances are +/- 5% on everything except the negative voltageswhich are +/- 10%. The -5V pin may show nothing, which is normal (modernsystems do not use -5V for anything, and actually most don't use -12Vanymore either, though that's still provided). This won't confirm thatthe PSU isn't suspect (it could still have issues under load), but ifany of this testing fails, you can be 100% certain you have either a badPSU, or that your mains power is suspect (usually the issue there isvery high line noise, though you'll need special equipment to test forthat).


Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
mount (not by me, in any case), but I think Mint runs fstrim
periodically by default. Just to be sure, should any form of TRIM be
disabled?

The issue with TRIM is that it drops old copies of the on-disk datastructures used by BTRFS, which can make recovery more difficult in theevent of a crash. Running `fstrim` at regular intervals is not as muchof an issue as inline discard, but still drops the old trees, so there'sa window of time right after it gets run when you are more vulnerable.

Additionally, some SSD's have had issues with TRIM causing datacorruption elsewhere on the disk, but it's been years since I've seen areport of such issues, and I don't think a Samsung device as recent asyours is likely to have such problems.

The only other idea I've got is Timeshift's hourly snapshots. (How)
would btrfs deal with a crash during snapshot creation?

It should have no issues whatsoever most of the time. The only case Ican think of where it might is if you're snapshotting a subvolume that'sbeing written to at the same time. Snapshots on BTRFS are only trulyatomic if none of the data being snapshotted is being written to at thesame time. If there are pending writes, there are some indeterminatestates involved, and crashing then might produce a corrupted snapshot,but shouldn't cause any other issues.



In other news, I've still not quite given up, mainly because the fs
doesn't look all that broken. The output of btrfs inspect-internal
dump-tree (incl. options), for instance, looks like gibberish to me of
course, but it looks sane, doesn't spew warnings, doesn't error out or
crash. Also plain btrfs check --init-extent-tree errored out, same
with -s0, but with -s1 it's now chugging along. (BTW, is there a
hierarchy among the super block slots, a best or newest one?)

AIUI, when they get updated, they get written out in the order theyoccur on disk, but other than that they're supposed to always bein-sync. So if you have an issue when the first is being written out,you can often recover by using the second or later ones.


Will keep you posted.

Cheers,
C.

Re: first it froze, now the (btrfs) root fs won't mount ...

Reply via email to