Re: File system corruption due to UFS2 extended attributes
On Mon, May 23, 2022 at 09:47:37PM -0700, Chuck Silvers wrote: > So what can we do about this? There aren't any really great > options. But the only change which will guarantee that all old > NetBSD releases (which do not know about extend attributes) will > not corrupt file system images where extended attributes have been > stored is to create a new variant of UFS2 with a different magic > number (the "fs_magic" field in the superblock). This is what I > propose to do. I spoke with Kirk McKusick about this problem and > he agreed that creating a new UFS2 variant with a different magic > number is the best way to deal with this situation. On the minus side, this means all FreeBSD volumes (which do know about extended attributes) will be treated as NetBSD 9 volumes (which don't). There probably isn't any way around this, and it isn't the first time this has happened, including for UFS1 (e.g. the wapbl bit), so maybe we just ought to have our own format going forwards, since this: : /* : * NOTE: COORDINATE ON-DISK FORMAT CHANGES WITH THE FREEBSD PROJECT. : */ repeatedly hasn't worked. But in that case the names of options and whatnot should be set up accordingly and the default should be our format. We did a migration like this with partition types years ago and AFAICR it wasn't perfect but wasn't a trainwreck either. also, a quibble: > - fsck will take a new option "-c ea" to specify that an existing UFS2 >file system should be converted to support extended attributes >(ie. converted to UFS2ea). The migration code really belongs in tunefs rather than fsck. :-| -- David A. Holland dholl...@netbsd.org
Re: File system corruption due to UFS2 extended attributes
Date:Wed, 25 May 2022 07:52:29 -0300 From:Crystal Kolipe Message-ID: | FreeBSD and DragonflyBSD are basically the same as NetBSD in terms of | disklabel layout, so that issue doesn't exists. If all OpenBSD are doing is using some otherwise unused space, then we might never even notice (but I have not looked to see). I don't know what Dragonfly do, but the last time I looked (long ago, but I doubt it has changed) FreeBSD labels contained block numbers relative to the start of the MBR partition on architectures using MBR. NetBSD labels are always relative to the start of the drive (i.e. absolute block numbers). That's fundamentally different. But label differences are a minor issue, easy to work around, and becoming less and less relevant as time passes and they get used much less. Filesystem layout and use differences are a whole other problem especially when at first glance they appear to be the same and many things seem to work - but not everything. It would be nice to try and reconverge on a common format, and needing to use an updated magic number is the ideal time to make that change. It means more work, and a bigger format change but if it could be accomplished there is the potential for long term benefits. kre
Re: File system corruption due to UFS2 extended attributes
On Wed, May 25, 2022 at 02:00:44AM -0700, Chuck Silvers wrote: > On Tue, May 24, 2022 at 07:51:08AM -0400, Greg Troxel wrote: > > And same questions for the other active BSD variants, which I think is > > mostly OpenBSD and Dragonfly these days but I have lost track. > > OpenBSD UFS2 appears to be the same as NetBSD <=9 with respect to > extended attributes (extattrs are not supported). OpenBSD's treatment > of fs_flags is different as well, only two fs_flags bits are recognized > and unknown flags are not cleared. At least one superblock field > is different too. With regards to exchanging filesystems between OpenBSD and NetBSD, it's worth noting that OpenBSD has also diverged slightly in the format of the disklabel, for example by repurposing some old and little used fields to hold a DUID value. This _may_ imply that it's safe(r) to assume that anybody sharing a UFS2 filesystem between OpenBSD and another BSD system knows what they are doing, as they will likely already have encountered compatibility issues if they have got far enough for the UFS2 extended attributes to be a concern. FreeBSD and DragonflyBSD are basically the same as NetBSD in terms of disklabel layout, so that issue doesn't exists. For more information, see the sections: 'BSD disklabels - compatibility betweeen BSDs' 'BSD disklabels - enhancements in OpenBSD' of: https://www.exoticsilicon.com/jay/reckless_guide_to_openbsd/bsd_disklabels or gemini://gemini.exoticsilicon.com/jay/reckless_guide_to_openbsd/bsd_disklabels
Re: File system corruption due to UFS2 extended attributes
On Tue, May 24, 2022 at 07:51:08AM -0400, Greg Troxel wrote: > > Chuck Silvers writes: > > > The introduction in NetBSD's implementation of UFS2 of the extended > > attribute code from FreeBSD has introduced a compatibility problem > > with previous releases of NetBSD. The explanation of this problem is > > a bit involved and requires knowing some history, so please bear with me > > as I explain. > > Your analysis and approach make sense to me, even though it's > regrettable that it is necessary. I guess UFS needs zfs-style feature > flags > > What about compatibility with FreeBSD? > > - What happens if someone takes a FreeBSD UFS2 filesystem and mounts > it under NetBSD 9? FreeBSD UFS2 and NetBSD 9 UFS2 are "somewhat" compatible, the main exceptions being extended attributes and the interpretation of some of the fs_flags bits in the superblock. These fs_flags bits that are used different between the two control enablement of various optional features, such as "check hashes" in FreeBSD, and wapbl and "quota2" in NetBSD. Note that FreeBSD's bit for "check hashes" and NetBSD's bit for "quota2" are the same bit, so if this bit is set by one OS then the other OS will do the wrong thing. FreeBSD would decide that everything in the NetBSD fs is corrupt because none of the check hashes matches. NetBSD will refuse to mount a FreeBSD fs read/write because other quota2 information is missing or wrong (this one I know from recent experience). Similarly, the bits for FreeBSD "NFS4 ACLs" and NetBSD "wapbl" are the same. FreeBSD only clears some unknown fs_flags bits, whereas NetBSD clears all unknown fs_flags bits. Looking again now, I see that various of the newer superblock fields are also different. These fields were added by reusing some of the various "spare" bytes that were available, but often the same "spare" bytes were reused for different purposes by each OS. I'm sure the different interpretations of some of these newer fields can cause trouble, however sometimes nothing obviously bad happens when a file system created on one OS is used on the other OS. It all depends on exactly what you do. > - What happens if someone tries to mount a NetBSD <=9 UFS2 filesystem > on FreeBSD? A 10 UFS2 filesystem w/o ea? with? NetBSD <=9 UFS2 vs FreeBSD UFS2 is described above. NetBSD 10 UFS2 (non-ea) will be the same as NetBSD <=9 UFS2 after the changes that I am proposing now. NetBSD 10 UFS2ea will not be recognized at all by FreeBSD (or by NetBSD <=9). > Or is it already the case that FreeBSD and NetBSD do not interoperate > with UFS2? They will each try to operate on the other's UFS2 file systems (because they can't tell the difference), but there is a good chance that data loss will result if you mount read/write from the other OS. > And same questions for the other active BSD variants, which I think is > mostly OpenBSD and Dragonfly these days but I have lost track. OpenBSD UFS2 appears to be the same as NetBSD <=9 with respect to extended attributes (extattrs are not supported). OpenBSD's treatment of fs_flags is different as well, only two fs_flags bits are recognized and unknown flags are not cleared. At least one superblock field is different too. Dragonfly does not support UFS2 at all. -Chuck
Re: File system corruption due to UFS2 extended attributes
On Tue, May 24, 2022 at 06:25:34AM -, Michael van Elst wrote: > c...@chuq.com (Chuck Silvers) writes: > > > - fsck will take a new option "-c ea" to specify that an existing UFS2 > > file system should be converted to support extended attributes > > (ie. converted to UFS2ea). This conversion first clears all of the > > on-disk > > pointers to extended attribute blocks (the inode "di_extb" field), > > since in NetBSD releases prior to NetBSD 10, those pointers could only > > have been set to non-zero values by corruption in the file system. > > There should be a way back so that the filesystem becomes usuable > by netbsd-9 again (basically: clear di_extb and set magic to UFS2). > Would also be nice to pull up that feature to netbsd-9. (please don't remove current-users from the cc, this discussion is as much for that audience as it is for tech-kern) having an option to fsck to convert back to non-ea UFS2 is reasonable, with the warning that this results in throwing away all extattrs in the fs. I'll add that. note that this will also free any blocks which were being used to store extattr data. back-porting that option to netbsd-9 can be done as well, though of course it wouldn't help if the fs in question is the root fs. -Chuck
Re: File system corruption due to UFS2 extended attributes
Chuck Silvers writes: > The introduction in NetBSD's implementation of UFS2 of the extended > attribute code from FreeBSD has introduced a compatibility problem > with previous releases of NetBSD. The explanation of this problem is > a bit involved and requires knowing some history, so please bear with me > as I explain. Your analysis and approach make sense to me, even though it's regrettable that it is necessary. I guess UFS needs zfs-style feature flags What about compatibility with FreeBSD? - What happens if someone takes a FreeBSD UFS2 filesystem and mounts it under NetBDS 9? - What happens if someone tries to mount a NetBSD <=9 UFS2 filesystem on FreeBSD? A 10 UFS2 filesystem w/o ea? with? Or is it already the case that FreeBSD and NetBSD do not interoperate with UFS2? And same questions for the other active BSD variants, which I think is mostly OpenBSD and Dragonfly these days but I have lost track. signature.asc Description: PGP signature
Re: file system corruption
On Fri, Oct 16, 2020 at 12:26:03AM +0900, Rin Okuyama wrote: > On 2020/10/15 20:27, Thomas Klausner wrote: > > On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote: > > > Is yours a ryzen system? (mine is, and it has filesystem issues - just > > > trying to see why it is not a common issue) > > > > Yes: > > There was a report on Twitter (in Japanese): > > https://twitter.com/rin5roid/status/1312728335299104768 > > GCC for aarch64 built by Ryzen causes SIGILL, while that built by > Intel processor works without problems. I've never observed such a > failure (I'm using only Intel processors at the moment). I don't think it's a general problem - until my update from early October (and now after downgrading the kernel) the machine is stable. Thomas
Re: file system corruption
On 2020/10/15 20:27, Thomas Klausner wrote: On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote: Is yours a ryzen system? (mine is, and it has filesystem issues - just trying to see why it is not a common issue) Yes: There was a report on Twitter (in Japanese): https://twitter.com/rin5roid/status/1312728335299104768 GCC for aarch64 built by Ryzen causes SIGILL, while that built by Intel processor works without problems. I've never observed such a failure (I'm using only Intel processors at the moment). Thanks, rin
Re: file system corruption
On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote: > Is yours a ryzen system? (mine is, and it has filesystem issues - just > trying to see why it is not a common issue) Yes: # cpuctl identify 0 cpu0: highest basic info 000d cpu0: highest extended info 801f cpu0: "AMD Ryzen Threadripper 2950X 16-Core Processor " cpu0: AMD Family 17h (686-class), 3493.44 MHz cpu0: family 0x17 model 0x8 stepping 0x2 (id 0x800f82) cpu0: features 0x178bfbff cpu0: features 0x178bfbff cpu0: features1 0x7ed8320b cpu0: features1 0x7ed8320b cpu0: features2 0x2fd3fbff cpu0: features2 0x2fd3fbff cpu0: features3 0x35c233ff cpu0: features3 0x35c233ff cpu0: features3 0x35c233ff cpu0: features5 0x209c01a9 cpu0: features5 0x209c01a9 ... Thomas
Re: file system corruption
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote: > I've had serious file system corruption. Mostly in mercurial and > sqlite3 databases, but also in normal files. > Anyone else having problems? Is yours a ryzen system? (mine is, and it has filesystem issues - just trying to see why it is not a common issue) Cheers, Patrick
Re: file system corruption
On Mon, Oct 12, 2020 at 06:39:48AM +0200, Martin Husemann wrote: > On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote: > > I don't know enough about the internals of the hg and sqlite3, but I > > also saw a broken zip archive and had a good copy for comparison. In > > that case, a block of 256 bytes was zero instead of the real data. > > Do you know the file offset where the corruption started? > Can you show "dumpfs $rawdev | head -15" for that file system? Reminds me of PR kern/55362. If I started with a disk full of zeros, some ranges would have zero instead of the real data. If I started with a disk full of ones, some ranges would contain ones instead of the real data. In other news, just now, after a clean reboot to use a new kernel, the system came up with [ 1885.434544] panic: ffs_blkfree: bad size: dev = 0xa803, bno = 331526 bsize = 32768, size = 12288, fs = /usr/obj (different filesytem & disk) Cheers, Patrick
Re: file system corruption
On Mon, Oct 12, 2020 at 06:39:48AM +0200, Martin Husemann wrote: > Do you know the file offset where the corruption started? I don't have that one any more, but I found a different one. In this case, the range of bytes from 1291124737-1291157504 (32768 bytes) is zeroed out. 1291124737 = 0x4CF50001, 1291157504 = 0x4CF58000. It's on NFS. > Can you show "dumpfs $rawdev | head -15" for that file system? One was NFS where I can't get this to work. The other is dumpfs /disk/storage_202008 | head -15 file system: /dev/rdk5 format FFSv2 endian little-endian location 65536 (-b 128) magic 19540119timeTue Oct 13 22:11:31 2020 superblock location 65536 id [ 5f2bfb64 6b0a718f ] cylgrp dynamic inodes FFSv2 sblock FFSv2 fslevel 5 nbfree 123342287 ndir4136nifree 120190114 nffree 14975 ncg 17107 size3856137728 blocks 3848336845 bsize 32768 shift 15 mask0x8000 fsize 4096shift 12 mask0xf000 frag8 shift 3 fsbtodb 3 bpg 28177 fpg 225416 ipg 7040 minfree 1% optim space maxcontig 2 maxbpg 4096 symlinklen 120 contigsumsize 2 (256 bytes seemed small to me for a file system issue - one reason more to thing it might be UVM related). Thomas
Re: file system corruption
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote: > Hi! > > I've recently updated from 9.99.73 from Sep 17 to one of Oct 5. > > I've had serious file system corruption. Mostly in mercurial and > sqlite3 databases, but also in normal files. what platform is this on? > Some of the file systems where this happened are NFS-served from Linux, > but I also saw it on a local FFSv2. I would ask what the fs block size is, but with NFS there isn't any fs block size involved, so I doubt it matters for the local file systems either. > 2c2 > < $NetBSD: uvm_amap.c,v 1.123 2020/08/18 10:40:20 chs Exp $ > --- > > $NetBSD: uvm_amap.c,v 1.125 2020/09/21 18:41:59 chs Exp $ > > 11c11 > < $NetBSD: uvm_io.c,v 1.28 2016/05/25 17:43:58 christos Exp $ > --- > > $NetBSD: uvm_io.c,v 1.29 2020/09/21 18:41:59 chs Exp $ the above changes go togther, and they were long enough ago that if they were the cause of the corruption then it seems likely that someone else would have reported it before you. also, these changes are about process address space manipulation and not file systems, so if this were the problem then you would be getting non-file-system symptoms too. > 5c5 > < $NetBSD: uvm_bio.c,v 1.121 2020/07/09 09:24:32 rin Exp $ > --- > > $NetBSD: uvm_bio.c,v 1.122 2020/10/05 04:48:23 rin Exp $ this change is somewhat more recent and specifically about file systems, so this seems more likely. could you try testing with each of the above sets of changes separately backed out, to see if you can narrow it down to one change? if the problem is not due to either of those sets of changes then your best bet is to bisect to find the change that introduced the problem. I tried to check the automated test results to see if those are showing any problems that look related, but that web server is down right now. > Anyone else having problems? > > Any ideas? > Thomas -Chuck
Re: file system corruption
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote: > I don't know enough about the internals of the hg and sqlite3, but I > also saw a broken zip archive and had a good copy for comparison. In > that case, a block of 256 bytes was zero instead of the real data. Do you know the file offset where the corruption started? Can you show "dumpfs $rawdev | head -15" for that file system? Martin