Re: File system corruption due to UFS2 extended attributes

2022-05-25 Thread David Holland
On Mon, May 23, 2022 at 09:47:37PM -0700, Chuck Silvers wrote:
 > So what can we do about this?  There aren't any really great
 > options.  But the only change which will guarantee that all old
 > NetBSD releases (which do not know about extend attributes) will
 > not corrupt file system images where extended attributes have been
 > stored is to create a new variant of UFS2 with a different magic
 > number (the "fs_magic" field in the superblock).  This is what I
 > propose to do.  I spoke with Kirk McKusick about this problem and
 > he agreed that creating a new UFS2 variant with a different magic
 > number is the best way to deal with this situation.

On the minus side, this means all FreeBSD volumes (which do know about
extended attributes) will be treated as NetBSD 9 volumes (which
don't).

There probably isn't any way around this, and it isn't the first time
this has happened, including for UFS1 (e.g. the wapbl bit), so maybe
we just ought to have our own format going forwards, since this:

 : /*
 :  * NOTE: COORDINATE ON-DISK FORMAT CHANGES WITH THE FREEBSD PROJECT.
 :  */

repeatedly hasn't worked.

But in that case the names of options and whatnot should be set up
accordingly and the default should be our format.

We did a migration like this with partition types years ago and AFAICR
it wasn't perfect but wasn't a trainwreck either.


also, a quibble:

 >  - fsck will take a new option "-c ea" to specify that an existing UFS2
 >file system should be converted to support extended attributes
 >(ie. converted to UFS2ea).

The migration code really belongs in tunefs rather than fsck. :-|

-- 
David A. Holland
dholl...@netbsd.org


Re: File system corruption due to UFS2 extended attributes

2022-05-25 Thread Robert Elz
Date:Wed, 25 May 2022 07:52:29 -0300
From:Crystal Kolipe 
Message-ID:  

  | FreeBSD and DragonflyBSD are basically the same as NetBSD in terms of
  | disklabel layout, so that issue doesn't exists.

If all OpenBSD are doing is using some otherwise unused space,
then we might never even notice (but I have not looked to see).

I don't know what Dragonfly do, but the last time I looked
(long ago, but I doubt it has changed) FreeBSD labels
contained block numbers relative to the start of the MBR
partition on architectures using MBR.  NetBSD labels are
always relative to the start of the drive (i.e. absolute block
numbers).   That's fundamentally different.

But label differences are a minor issue, easy to work around,
and becoming less and less relevant as time passes and they get
used much less.

Filesystem layout and use differences are a whole other problem
especially when at first glance they appear to be the same
and many things seem to work - but not everything.  It would
be nice to try and reconverge on a common format, and needing
to use an updated magic number is the ideal time to make
that change.  It means more work, and a bigger format change
but if it could be accomplished there is the potential for
long term benefits.

kre


Re: File system corruption due to UFS2 extended attributes

2022-05-25 Thread Crystal Kolipe
On Wed, May 25, 2022 at 02:00:44AM -0700, Chuck Silvers wrote:
> On Tue, May 24, 2022 at 07:51:08AM -0400, Greg Troxel wrote:
> > And same questions for the other active BSD variants, which I think is
> > mostly OpenBSD and Dragonfly these days but I have lost track.
> 
> OpenBSD UFS2 appears to be the same as NetBSD <=9 with respect to
> extended attributes (extattrs are not supported).  OpenBSD's treatment
> of fs_flags is different as well, only two fs_flags bits are recognized
> and unknown flags are not cleared.  At least one superblock field
> is different too.

With regards to exchanging filesystems between OpenBSD and NetBSD, it's
worth noting that OpenBSD has also diverged slightly in the format of the
disklabel, for example by repurposing some old and little used fields to
hold a DUID value.

This _may_ imply that it's safe(r) to assume that anybody sharing a UFS2
filesystem between OpenBSD and another BSD system knows what they are
doing, as they will likely already have encountered compatibility issues
if they have got far enough for the UFS2 extended attributes to be a
concern.

FreeBSD and DragonflyBSD are basically the same as NetBSD in terms of
disklabel layout, so that issue doesn't exists.

For more information, see the sections:

'BSD disklabels - compatibility betweeen BSDs'
'BSD disklabels - enhancements in OpenBSD'

of:

https://www.exoticsilicon.com/jay/reckless_guide_to_openbsd/bsd_disklabels

or

gemini://gemini.exoticsilicon.com/jay/reckless_guide_to_openbsd/bsd_disklabels


Re: File system corruption due to UFS2 extended attributes

2022-05-25 Thread Chuck Silvers
On Tue, May 24, 2022 at 07:51:08AM -0400, Greg Troxel wrote:
> 
> Chuck Silvers  writes:
> 
> > The introduction in NetBSD's implementation of UFS2 of the extended
> > attribute code from FreeBSD has introduced a compatibility problem
> > with previous releases of NetBSD.  The explanation of this problem is
> > a bit involved and requires knowing some history, so please bear with me
> > as I explain.
> 
> Your analysis and approach make sense to me, even though it's
> regrettable that it is necessary.  I guess UFS needs zfs-style feature
> flags
> 
> What about compatibility with FreeBSD?
> 
>   - What happens if someone takes a FreeBSD UFS2 filesystem and mounts
> it under NetBSD 9?

FreeBSD UFS2 and NetBSD 9 UFS2 are "somewhat" compatible, the main exceptions
being extended attributes and the interpretation of some of the fs_flags bits
in the superblock.  These fs_flags bits that are used different between
the two control enablement of various optional features, such as
"check hashes" in FreeBSD, and wapbl and "quota2" in NetBSD.
Note that FreeBSD's bit for "check hashes" and NetBSD's bit for "quota2"
are the same bit, so if this bit is set by one OS then the other OS
will do the wrong thing.  FreeBSD would decide that everything in
the NetBSD fs is corrupt because none of the check hashes matches.
NetBSD will refuse to mount a FreeBSD fs read/write because other quota2
information is missing or wrong (this one I know from recent experience).

Similarly, the bits for FreeBSD "NFS4 ACLs" and NetBSD "wapbl" are the same.

FreeBSD only clears some unknown fs_flags bits, whereas NetBSD clears
all unknown fs_flags bits.

Looking again now, I see that various of the newer superblock fields are
also different.  These fields were added by reusing some of the various
"spare" bytes that were available, but often the same "spare" bytes were
reused for different purposes by each OS.  I'm sure the different
interpretations of some of these newer fields can cause trouble,
however sometimes nothing obviously bad happens when a file system
created on one OS is used on the other OS.

It all depends on exactly what you do.


>   - What happens if someone tries to mount a NetBSD <=9 UFS2 filesystem
> on FreeBSD?   A 10 UFS2 filesystem w/o ea?  with?

NetBSD <=9 UFS2 vs FreeBSD UFS2 is described above.
NetBSD 10 UFS2 (non-ea) will be the same as NetBSD <=9 UFS2
after the changes that I am proposing now.
NetBSD 10 UFS2ea will not be recognized at all by FreeBSD (or by NetBSD <=9).


> Or is it already the case that FreeBSD and NetBSD do not interoperate
> with UFS2?

They will each try to operate on the other's UFS2 file systems
(because they can't tell the difference), but there is a good chance
that data loss will result if you mount read/write from the other OS.


> And same questions for the other active BSD variants, which I think is
> mostly OpenBSD and Dragonfly these days but I have lost track.

OpenBSD UFS2 appears to be the same as NetBSD <=9 with respect to
extended attributes (extattrs are not supported).  OpenBSD's treatment
of fs_flags is different as well, only two fs_flags bits are recognized
and unknown flags are not cleared.  At least one superblock field
is different too.

Dragonfly does not support UFS2 at all.

-Chuck


Re: File system corruption due to UFS2 extended attributes

2022-05-25 Thread Chuck Silvers
On Tue, May 24, 2022 at 06:25:34AM -, Michael van Elst wrote:
> c...@chuq.com (Chuck Silvers) writes:
> 
> > - fsck will take a new option "-c ea" to specify that an existing UFS2
> >   file system should be converted to support extended attributes
> >   (ie. converted to UFS2ea).  This conversion first clears all of the 
> > on-disk
> >   pointers to extended attribute blocks (the inode "di_extb" field),
> >   since in NetBSD releases prior to NetBSD 10, those pointers could only
> >   have been set to non-zero values by corruption in the file system.
> 
> There should be a way back so that the filesystem becomes usuable
> by netbsd-9 again (basically: clear di_extb and set magic to UFS2).
> Would also be nice to pull up that feature to netbsd-9.

(please don't remove current-users from the cc, this discussion is
as much for that audience as it is for tech-kern)

having an option to fsck to convert back to non-ea UFS2 is reasonable,
with the warning that this results in throwing away all extattrs in the fs.
I'll add that.  note that this will also free any blocks which were
being used to store extattr data.

back-porting that option to netbsd-9 can be done as well,
though of course it wouldn't help if the fs in question is the root fs.

-Chuck


Re: File system corruption due to UFS2 extended attributes

2022-05-24 Thread Greg Troxel

Chuck Silvers  writes:

> The introduction in NetBSD's implementation of UFS2 of the extended
> attribute code from FreeBSD has introduced a compatibility problem
> with previous releases of NetBSD.  The explanation of this problem is
> a bit involved and requires knowing some history, so please bear with me
> as I explain.

Your analysis and approach make sense to me, even though it's
regrettable that it is necessary.  I guess UFS needs zfs-style feature
flags

What about compatibility with FreeBSD?

  - What happens if someone takes a FreeBSD UFS2 filesystem and mounts
it under NetBDS 9?

  - What happens if someone tries to mount a NetBSD <=9 UFS2 filesystem
on FreeBSD?   A 10 UFS2 filesystem w/o ea?  with?

Or is it already the case that FreeBSD and NetBSD do not interoperate
with UFS2?

And same questions for the other active BSD variants, which I think is
mostly OpenBSD and Dragonfly these days but I have lost track.


signature.asc
Description: PGP signature


Re: file system corruption

2020-10-15 Thread Thomas Klausner
On Fri, Oct 16, 2020 at 12:26:03AM +0900, Rin Okuyama wrote:
> On 2020/10/15 20:27, Thomas Klausner wrote:
> > On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote:
> > > Is yours a ryzen system? (mine is, and it has filesystem issues - just
> > > trying to see why it is not a common issue)
> > 
> > Yes:
> 
> There was a report on Twitter (in Japanese):
> 
> https://twitter.com/rin5roid/status/1312728335299104768
> 
> GCC for aarch64 built by Ryzen causes SIGILL, while that built by
> Intel processor works without problems. I've never observed such a
> failure (I'm using only Intel processors at the moment).

I don't think it's a general problem - until my update from early
October (and now after downgrading the kernel) the machine is stable.
 Thomas


Re: file system corruption

2020-10-15 Thread Rin Okuyama

On 2020/10/15 20:27, Thomas Klausner wrote:

On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote:

Is yours a ryzen system? (mine is, and it has filesystem issues - just
trying to see why it is not a common issue)


Yes:


There was a report on Twitter (in Japanese):

https://twitter.com/rin5roid/status/1312728335299104768

GCC for aarch64 built by Ryzen causes SIGILL, while that built by
Intel processor works without problems. I've never observed such a
failure (I'm using only Intel processors at the moment).

Thanks,
rin


Re: file system corruption

2020-10-15 Thread Thomas Klausner
On Thu, Oct 15, 2020 at 12:03:36PM +0100, Patrick Welche wrote:
> Is yours a ryzen system? (mine is, and it has filesystem issues - just
> trying to see why it is not a common issue)

Yes:

# cpuctl identify 0
cpu0: highest basic info 000d
cpu0: highest extended info 801f
cpu0: "AMD Ryzen Threadripper 2950X 16-Core Processor "
cpu0: AMD Family 17h (686-class), 3493.44 MHz
cpu0: family 0x17 model 0x8 stepping 0x2 (id 0x800f82)
cpu0: features 0x178bfbff
cpu0: features 0x178bfbff
cpu0: features1 0x7ed8320b
cpu0: features1 0x7ed8320b
cpu0: features2 0x2fd3fbff
cpu0: features2 0x2fd3fbff
cpu0: features3 0x35c233ff
cpu0: features3 0x35c233ff
cpu0: features3 0x35c233ff
cpu0: features5 0x209c01a9
cpu0: features5 0x209c01a9
...

 Thomas


Re: file system corruption

2020-10-15 Thread Patrick Welche
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote:
> I've had serious file system corruption. Mostly in mercurial and
> sqlite3 databases, but also in normal files.

> Anyone else having problems?

Is yours a ryzen system? (mine is, and it has filesystem issues - just
trying to see why it is not a common issue)


Cheers,

Patrick


Re: file system corruption

2020-10-14 Thread Patrick Welche
On Mon, Oct 12, 2020 at 06:39:48AM +0200, Martin Husemann wrote:
> On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote:
> > I don't know enough about the internals of the hg and sqlite3, but I
> > also saw a broken zip archive and had a good copy for comparison. In
> > that case, a block of 256 bytes was zero instead of the real data.
> 
> Do you know the file offset where the corruption started?
> Can you show "dumpfs $rawdev | head -15" for that file system?

Reminds me of PR kern/55362. If I started with a disk full of zeros,
some ranges would have zero instead of the real data. If I started
with a disk full of ones, some ranges would contain ones instead of
the real data.


In other news, just now, after a clean reboot to use a new kernel, the
system came up with

[  1885.434544] panic: ffs_blkfree: bad size: dev = 0xa803, bno = 331526 bsize 
= 32768, size = 12288, fs = /usr/obj

(different filesytem & disk)


Cheers,

Patrick


Re: file system corruption

2020-10-14 Thread Thomas Klausner
On Mon, Oct 12, 2020 at 06:39:48AM +0200, Martin Husemann wrote:
> Do you know the file offset where the corruption started?

I don't have that one any more, but I found a different one.

In this case, the range of bytes from 1291124737-1291157504 (32768
bytes) is zeroed out. 1291124737 = 0x4CF50001, 1291157504 =
0x4CF58000. It's on NFS.

> Can you show "dumpfs $rawdev | head -15" for that file system?

One was NFS where I can't get this to work.

The other is

dumpfs /disk/storage_202008 | head -15
file system: /dev/rdk5
format  FFSv2
endian  little-endian
location 65536  (-b 128)
magic   19540119timeTue Oct 13 22:11:31 2020
superblock location 65536   id  [ 5f2bfb64 6b0a718f ]
cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
nbfree  123342287   ndir4136nifree  120190114   nffree  14975
ncg 17107   size3856137728  blocks  3848336845
bsize   32768   shift   15  mask0x8000
fsize   4096shift   12  mask0xf000
frag8   shift   3   fsbtodb 3
bpg 28177   fpg 225416  ipg 7040
minfree 1%  optim   space   maxcontig 2 maxbpg  4096
symlinklen 120  contigsumsize 2

(256 bytes seemed small to me for a file system issue - one reason
more to thing it might be UVM related).
 Thomas


Re: file system corruption

2020-10-12 Thread Chuck Silvers
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote:
> Hi!
> 
> I've recently updated from 9.99.73 from Sep 17 to one of Oct 5.
> 
> I've had serious file system corruption. Mostly in mercurial and
> sqlite3 databases, but also in normal files.

what platform is this on?


> Some of the file systems where this happened are NFS-served from Linux,
> but I also saw it on a local FFSv2.

I would ask what the fs block size is, but with NFS there isn't
any fs block size involved, so I doubt it matters for the
local file systems either.


> 2c2
> <  $NetBSD: uvm_amap.c,v 1.123 2020/08/18 10:40:20 chs Exp $
> ---
> >  $NetBSD: uvm_amap.c,v 1.125 2020/09/21 18:41:59 chs Exp $
> 
> 11c11
> <  $NetBSD: uvm_io.c,v 1.28 2016/05/25 17:43:58 christos Exp $
> ---
> >  $NetBSD: uvm_io.c,v 1.29 2020/09/21 18:41:59 chs Exp $

the above changes go togther, and they were long enough ago that
if they were the cause of the corruption then it seems likely
that someone else would have reported it before you.
also, these changes are about process address space manipulation
and not file systems, so if this were the problem then you would
be getting non-file-system symptoms too.


> 5c5
> <  $NetBSD: uvm_bio.c,v 1.121 2020/07/09 09:24:32 rin Exp $
> ---
> >  $NetBSD: uvm_bio.c,v 1.122 2020/10/05 04:48:23 rin Exp $

this change is somewhat more recent and specifically about
file systems, so this seems more likely.

could you try testing with each of the above sets of changes separately
backed out, to see if you can narrow it down to one change?
if the problem is not due to either of those sets of changes
then your best bet is to bisect to find the change that introduced
the problem.

I tried to check the automated test results to see if those are showing
any problems that look related, but that web server is down right now.


> Anyone else having problems?
> 
> Any ideas?
>  Thomas

-Chuck


Re: file system corruption

2020-10-11 Thread Martin Husemann
On Sun, Oct 11, 2020 at 11:19:16PM +0200, Thomas Klausner wrote:
> I don't know enough about the internals of the hg and sqlite3, but I
> also saw a broken zip archive and had a good copy for comparison. In
> that case, a block of 256 bytes was zero instead of the real data.

Do you know the file offset where the corruption started?
Can you show "dumpfs $rawdev | head -15" for that file system?

Martin