@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on my issue.

-------- Original Message --------
Subject: Re: A new look at old NFS readdir() problems?
Date: 01/02/2025 10:08 am
From: Rick Macklem <rick.mack...@gmail.com>
To: Thomas Munro <tmu...@freebsd.org>
Cc: Rick Macklem <rmack...@freebsd.org>, Larry Rosenman <l...@lerctr.org>

On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro <tmu...@freebsd.org> wrote:

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ith...@uoguelph.ca.


Hi Rick
CC: ler

I hope you don't mind me reaching out directly, I just didn't really
want to spam existing bug reports without sufficient understanding to
actually help yet...  but I figured I should get in touch and see if
you have any clues or words of warning, since you've worked on so much
of the NFS code.  I'm a minor FBSD contributor and interested in file
systems, but not knowledgeable about NFS; I run into/debug/report a
lot of file system bugs on a lot of systems in my day job on
databases.  I'm interested to see if I can help with this problem.
Existing ancient report and interesting email:

https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696

What we ran into is not the "bad cookie" state, which doesn't really
seem to be recoverable in general, from what I understand (though the
FreeBSD code apparently would try, huh).  It's a simple case where the
NFS client requests a whole directory with a large READDIR request,
and then tries to unlink all the files in a traditional
while-readdir()-unlink() loop that works on other systems.
In general, NFS is not a POSIX compliant file system, due to its protocol
design. The above is one example. The only "safe" way is to opendir() or
rewinddir() after every removal.

The above usually works (and always worked for UFS long ago) because
the directory offset cookies for subsequent entries in the directory after
the entry unlinked happened to "still be valid". That is no longer true
for FreeBSD's UFS nor for many other file systems that can be exported.

If the client reads the entire directory in one READDIR, then it is fine,
since it has no need to the directory offset cookies. However, there is
a limit to how much a single READDIR can do (these days for NFSv4.1/4.2,
it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K at
the moment).

Another way to work around the problem is to read the entire directory
into the client via READDIRs before starting to do the unlinks.
The opendir()/readdir() code in libc could be hacked to do that,
but I have never tried to push such a patch into FreeBSD.
(It would be limited by how much memory can be malloc()'d, that
is pretty generous compared to even large directorys with 10s of
thousand entries.)

The above is true for all versions of NFS up to NFSv4.2, which is
the current one and unless some future version of NFS does READDIR
differently (I won't live long enough to see this;-), it will always
be the case.

If my comment above was not clear, the following encoding is the "safe"
way to remove all entries in a directory.

do {
      dir = opendir("X");
       dp = readdir(dir);
       if (dp != NULL)
            unlink(dp->d_name);
        close(dir);
} while (dp != NULL);

In theory, the directory_offset_cookie was supposed to handle this, but it
has never worked correctly, for a couple of reasons.
1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier correctly.
     It should only change when cookies for extant entries change. The
description
suggested it should change whenever an entry is deleted, since that cookie
     is no longer valid.
2 - #1 only works if directory offset cookies for other entries in the directory do not change when an entry is deleted. This used to be the case for UFS,
    but was broken in FreeBSD when a commit many years ago optimized
ufs_readdir() to compress out invalid entries. Doing this changes the directory offset cookies every time an entry is deleted at the beginning
    of a directory block.

rick
 On FreeBSD
it seems to clobber its own directory cache, make extra unnecessary
READDIR requests, and skip some of the files.  Or maybe I have no idea
what's going on and this is a hopelessly naive question and mission
:-)

Here's what we learned so far starting from Larry's report:

https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org

Note that this issue has nothing to do with "bad cookie" errors (I
doubt the server I'm talking to even implements that -- instead it
tries to have cookies that are persistent/stable).

Also, while looking into this and initially suspecting cookie
stability bugs (incorrectly), I checked a bunch of local file systems
to understand how their cookies work, and I think I found a related
problem when FreeBSD exports UFS, too.  I didn't repro this with NFS
but it's clearly visible from d_off locally with certain touch, rm
sequences.  First, let me state what I think the cookie should be
trying to achieve, on a system that doesn't implement "bad cookie" but
instead wants cookies that are persistent/always valid:  if you make a
series of READDIR requests using the cookie from the final entry of
the previous response, it should be impossible to miss any entry that
existed before your first call to readdir(), and impossible to see any
entry twice.  It is left undefined whether entries created after that
time are visible, since anything else would require unbounded time or
space via locks or multi-version magic (= isolation problems from
database-land).

Going back to the early 80s, Sun UFS looks good (based on illumos
source code) because it doesn't seem to move entries after they are
created.  That must have been the only file system when they invented
VFS and NFS.  Various other systems since have been either complex but
apparently good (ZFS/ZAP cursors can tolerate up to 2^16 hash
collisions which I think we can call statistically impossible, XFS
claims to be completely stable though I didn't understand fully why,
BTRFS assigns incrementing numbers that will hopefully not wrap, ...),
or nearly-good-enough-but-ugh (ext4 uses hashes like ZFS but
apparently fails with ELOOP on hash collisions?).  I was expecting
FreeBSD UFS to be like Sun UFS but I don't think it is!  In the UFS
code since at least 4.3BSD (but apparently not in the Sun version,
forked before or removed later?), inserting a new entry can compact a
directory page, which moves the offset of a directory entry lower.
AFAICS we can't move an entry lower, or we risk skipping it in NFS
readdir(), and we can't move it higher, or we risk double-reporting it
in readdir().  Or am I missing something?

Thanks for reading and happy new year,

Thomas Munro

--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: l...@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010


Reply via email to