On 2019-10-21 09:02, Christian Pernegger wrote:
[Please CC me, I'm not on the list.]

Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
<ahferro...@gmail.com>:
I've [worked with fs clones] like this dozens of times on single-device volumes 
with exactly zero issues.

Thank you, I have taken precautions, but it does seem to work fine.

There are actually two possible ways I can think of a buggy GPU driver causing 
this type of issue: [snip]

Interesting and plausible, but ...

Your best option for mitigation [...] is to ensure that your hardware has an 
IOMMU [...] and ensure it's enabled in firmware.

It has and it is. (The machine's been specced so GPU pass-through is
an option, should it be required. I haven't gotten around to setting
that up yet, haven't even gotten a second GPU, but I have laid the
groundwork, the IOMMU is enabled and, as far as one can tell from logs
and such, working.)

However, there's also the possibility that you may have hardware issues.

Don't I know it ... The problem is, if there are hardware issues,
that's the first I've seen of them, and while I didn't run torture
tests, there was quite a lot of benchmarking when it was new. Needle
in a haystack. Some memory testing can't hurt, I suppose. Any other
ideas (for hardware testing)?
The power supply would be the other big one I'd suggest testing, as a bad PSU can cause all kinds of odd intermittent issues. Just like with RAM, you can't really easily cover everything, but you can check some things that have very low false negative rates when indicating problems.

Typical procedure I use is:

1. Completely disconnect the PSU from _everything_ inside the computer. (If you're really paranoid, you can completely remove the PSU from the case too, though that won't really make the testing more reliable or safer). 2. Make sure the PSU itself is plugged in to mains power, with the switch on the back (if it has one) turned on. 3. Connect a good multimeter to the 24-pin main power connector, with the positive probe on pin 8 and the negative probe on pin 7, set to measure DC voltages in the double-digit range with the highest precision possible. 4. Short pins 15 and 16 of the 24-pin main power connector using a short piece of solid copper wire. At this point, if the PSU has a fan, the fan should turn on. The multimeter should read +5 volts within half a second or less. 5. Check voltages of each of the power rails relative to ground. Make sure and check each one for a couple of seconds to watch for any fluctuations, and make a point to check _each_ set of wires coming off of the PSU separately (as well as checking each wire in each connector independently, even if they're supposed to be tied together internally). 6. Check the =5V standby power by hooking up the multimeter to that and a ground pin, then disconnecting the copper wire mentioned in step 3. It should maintain it's voltage while you're disconnecting the wire and afterwards, even once the fan stops.

You can find the respective pinouts online in many places (for example, [1]). Tolerances are +/- 5% on everything except the negative voltages which are +/- 10%. The -5V pin may show nothing, which is normal (modern systems do not use -5V for anything, and actually most don't use -12V anymore either, though that's still provided). This won't confirm that the PSU isn't suspect (it could still have issues under load), but if any of this testing fails, you can be 100% certain you have either a bad PSU, or that your mains power is suspect (usually the issue there is very high line noise, though you'll need special equipment to test for that).

Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
mount (not by me, in any case), but I think Mint runs fstrim
periodically by default. Just to be sure, should any form of TRIM be
disabled?
The issue with TRIM is that it drops old copies of the on-disk data structures used by BTRFS, which can make recovery more difficult in the event of a crash. Running `fstrim` at regular intervals is not as much of an issue as inline discard, but still drops the old trees, so there's a window of time right after it gets run when you are more vulnerable.

Additionally, some SSD's have had issues with TRIM causing data corruption elsewhere on the disk, but it's been years since I've seen a report of such issues, and I don't think a Samsung device as recent as yours is likely to have such problems.

The only other idea I've got is Timeshift's hourly snapshots. (How)
would btrfs deal with a crash during snapshot creation?
It should have no issues whatsoever most of the time. The only case I can think of where it might is if you're snapshotting a subvolume that's being written to at the same time. Snapshots on BTRFS are only truly atomic if none of the data being snapshotted is being written to at the same time. If there are pending writes, there are some indeterminate states involved, and crashing then might produce a corrupted snapshot, but shouldn't cause any other issues.


In other news, I've still not quite given up, mainly because the fs
doesn't look all that broken. The output of btrfs inspect-internal
dump-tree (incl. options), for instance, looks like gibberish to me of
course, but it looks sane, doesn't spew warnings, doesn't error out or
crash. Also plain btrfs check --init-extent-tree errored out, same
with -s0, but with -s1 it's now chugging along. (BTW, is there a
hierarchy among the super block slots, a best or newest one?)
AIUI, when they get updated, they get written out in the order they occur on disk, but other than that they're supposed to always be in-sync. So if you have an issue when the first is being written out, you can often recover by using the second or later ones.

Will keep you posted.

Cheers,
C.


Reply via email to