On Tue, 12 May 2026 at 21:22 -0400, Nathan Myers wrote:
[This patch has only the XML text; the usual scripts have
not been run on it yet.]

Please do that, which will find the invalid XML in the patch.
https://gcc.gnu.org/onlinedocs/libstdc++/manual/documentation_hacking.html#docbook.rules

ISO C++ leaves concurrent use of std::filesystem operations
undefined. POSIX provides definitions for concurrent use. Here
we document POSIX (and sometimes GNU extension) operations used
to implement std::filesystem ops so that when run with libstdc++
hosted on POSIX, they are defined by the union of ISO C++ and
POSIX.

libstdc++/Changelog
        * doc/xml/manual/io.xml: Document POSIX usage in filesystem.
---
libstdc++-v3/doc/xml/manual/io.xml | 285 +++++++++++++++++++++++++++++
1 file changed, 285 insertions(+)

diff --git a/libstdc++-v3/doc/xml/manual/io.xml 
b/libstdc++-v3/doc/xml/manual/io.xml
index adc37cd8488..9be34ac7526 100644
--- a/libstdc++-v3/doc/xml/manual/io.xml
+++ b/libstdc++-v3/doc/xml/manual/io.xml
@@ -669,4 +669,289 @@
  </section>
</section>

+<!-- Sect1 05 : Filesystem Operations -->
+<section xml:id="std.io.fs" xreflabel="Filesystem Operations"><info><title>Filesystem 
Operations</title></info>
+<?dbhtml filename="io_fs_intro.html"?>
+  <section xml:id="std.io.fs.intro" 
xreflabel="Introduction"><info><title>Introduction</title></info>
+    <para>
+      The ISO C++ Standard Library defines operations on a global,
+      shared filesystem that, called in parallel, may race.

I would prefer to say "potentially concurrent" not "in parallel".

This seems to imply that the problem is only for concurrent uses of
std::filesystem within a process, but for all access to the file
system, including from other processes (which might not be using
std::filesystem at all). The text below says "another program" but I
think it should be mentioned here too.

+      Section <emphasis>[fs.race.behavior]</emphasis> of the
+      Standard identifies such races as <emphasis>undefined
+      behavior</emphasis>, "UB", so the behavior of programs
+      that do such filesystem operations in parallel may be
+      left undefined by the C++ Standard.
+      As Standard implementers are free to define what the Standard
+      does not, rehabilitating otherwise undefined programs on their
+      platform, this section provides definitions for many such
+      cases by relating Standard Library <type>filesystem</type>
+      operations implemented in libstdc++ to
+      corresponding ISO POSIX.1-2008 Standard or, in some cases,

This should be either ISO/IEC/IEEE 9945:2009 or just POSIX. I see no
reason to limit it to the 2008 standard, there have been recisions to
POSIX since then.

What about Windows? This should at least admit it exists, even if
defining anything is just punted by saying this document only applies
to POSIX-based operating systems.

But because the privilege escalation bug in filesystem::remove_all
(CVE-2022-21658) is fixed in libstdc++ *except* on Windows, I don't
think we should completely omit any discussion of Windows.

+      GNU operations that have well-defined (if possibly surprising)
+      semantics under race conditions.
+    </para>
+    <para>
+      We begin by identifying C++ Standard objects with POSIX objects.
+      Trivially, an ISO C++ <emphasis>file</emphasis> is exactly an
+      ISO POSIX
+      <emphasis>file</emphasis>, and likewise for
+      <emphasis>directory</emphasis>,
+      <emphasis>link</emphasis>,
+      <emphasis>hard link</emphasis>, and
+      <emphasis>symbolic link</emphasis>.
+      All of these are identified by names represented in C++
+      with a string-like type <type>filesystem::path</type> defined
+      in <emphasis>[fs.class.path]</emphasis>.
+      The contents of <type>path</type> are identical to the
+      strings passed to and delivered from POSIX.
+    </para>
+    <para>
+      The C++ Standard defines file system operations via functions
+      that do not necessarily map one-to-one to those defined in
+      POSIX or GNU, taking arguments that do not necessarily match
+      those that must be passed to underlying library functions.
+      This section defines those mappings so that the effects of
+      <type>filesystem</type> operations may be deduced from those
+      of the POSIX operations in places the C++ Standard may leave
+      undefined.
+      POSIX defines the effect of some races as
+      <emphasis>unspecified</emphasis>, with a set of possible
+      outcomes;
+      corresponding operations in libstdc++ will
+      have the same set of possible outcomes.


I think it could help to introduce a concept similar to the standard's
"modification order" for atomic objects. The OS serializes changes to
the file system, so although those changes are not exposed directly to
the C++ abstract machine, there *is* a modification order, and so any
std::filesystem operation which consists of a single call to a POSIX
function cannot race. That operation will see *some* valid value
corresponding to an atomic state of the filesystem at some point in
time.

Maybe we should introduce some kind of safety annotations similar to
the libc ones described by Alex at
https://developers.redhat.com/blog/2014/09/10/multi-thread-async-signal-and-async-cancel-safety-docs-in-gnu-libc#safety_annotations
and documented at 
https://sourceware.org/glibc/manual/latest/html_node/POSIX-Safety-Concepts.html

e.g. we could have FS-Atomic (or better name) meaning a call just
performs a single FS operation which gives you a valid result for some
instant in time, and has no undefined behaviour even in the face of
concurrent modifications of any portion of the pathname.

The file system might change immediately after the call, but there is
no undefined behaviour in the call itself even if another process is
performing operations on the same pathname.

If the program relies on the result still being true, the program
might introduce a misbehaviour, but it's not UB in the std::filesystem
call.

Maybe also introduce the concept of "FS race", for changes to the
state of the FS which invalidate some earlier result. Draw a
distinction between that and a "data race" in the standard, which is
always UB. If you do:

 if (std::filesystem::is_directory(p))
   for (auto e : std::filesystem::directory_iterator(p))
     frob(e);

The directory_iterator constructor might throw an exception if `p` is
removed or replaced with a regular file after the `is_directory` call.
That file system race is not a "data race", and is not UB.

And maybe a separate concept for intro-process races that only depend
on the current process, e.g. getcwd isn't affected by changes to the
file system, only by process state. If your process isn't changing CWD
in another thread, then filesystem::current_path() is never racy. So
maybe Process-Atomic, in contrast to FS-Atomic?


+    </para>
+    <para>
+      <type>std::filesystem</type> operations on relative paths
+      rely on the process-global "current working directory"
+      which may be changed at any time by another thread.

So most ops which are not FS-Atomic can misbehave if another thread in
the same process changes the CWD between e.g. a call to `lstat` with a
relative path followed by a `open` with the same relative path,
because the relative path might be resolved differently.

+      Furthermore, the contents of a filesystem may be changed
+      at any time by operations in another thread or another
+      program.
+      Operations that involve checking the current contents
+      of the filesystem and choosing subsequent operations
+      according to the results may have surprising results
+      if the filesystem contents are changed in between,
+      often with serious consequences to security.
+    </para>
+    <para>
+      In other words, saying "results of concurrent filesystem
+      operations are well-defined" does not mean that they will
+      necessarily be what was intended, and it is very easy to
+      introduce security vulnerabilities if extreme care is not
+      taken.
+    </para>
+  </section>
+  <section xml:id="std.io.fs.dir_iter" xreflabel="Directory 
Iteration"><info><title>Directory Iteration</title></info>
+    <para>
+      This section addresses iteration through filesystem directories,
+      as defined in <emphasis>[fs.class.directory.iterator]</emphasis>
+      and <emphasis>[fs.class.rec.dir.itr]</emphasis> in the Standard,
+      both recursive and not.
+    </para>
+    <para>
+      Constructing a
+      <function>filesystem::directory_iterator</function> or
+      <function>recursive_directory_iterator</function> on a directory
+      <function>path</function> uses POSIX
+      <function>openat</function>, <function>fdopendir</function>, and
+      <function>fstat</function>.

Nothing uses fstat. This appears several times below, I think it
should be just 'stat'.

+    </para>
+    <para>
+      Stepping into a subdirectory uses the same calls.
+    </para>
+    <para>
+      Incrementing an iterator uses POSIX <function>readdir</function> for
+      each entry.
+    </para>
+    <para>
+      Using any of the <function>directory_entry</function>
+      <emphasis>observer</emphasis> members, other than querying
+      <function>path</function>, <function>exists</function> or 
<function>is_</function>...
+      triggers an <function>fstat</function> operation on the first
+      such call.
+    </para>
+    <para>
+      Destroying one uses POSIX <function>close</close>

Mismatched XML end tag.
I think it's important to note that for modern POSIX systems,
directory iterators do *not* depend on the current working directory,
except on construction if using a relative path. When openat is
supported (POSIX.1-2008, or since Linux 2.6.16 && glibc 2.4) the
directory iterator opens a file descriptor for the directory, and then
all subsequent operations are done relative to the descriptor.

This makes directory iterators safe against many file system races,
and is exactly the kind of thing I think should be explained, rather
than just listing the POSIX APIs used.

If the contents of a directory are changed while iterating, you might
get errors reported (as std::filesystem::filesystem_error exceptions
or std::error_code values) or see inconsistent "surprising" results,
but I don't think there will be any UB here.

For Windows and systems without POSIX.1-2008 support, the directory
iterators will read directory entries and resurce into sub-directories
using relative paths, which can introduce TOCTTOU races during
traversal with the directory iterator. There's nothing we can do
about that for pre-2008 POSIX systems, and it would require
reimplementation in terms of Windows APIs to solve it for Windows (I
think Boost.Fileystem has done it, I am not doing to try to do that
myself).

+    </para>
+  </section>
+  <section xml:id="std.io.fs.ops" xreflabel="Filesystem 
Operations"><info><title>Filesystem Operations</title></info>
+    <para>
+      Most subsections here correspond to a subheading under
+      <emphasis>[fs.ops.funcs]</emphasis> in the Standard, defining
+      operations implemented by calling POSIX filesystem functions.
+      The first two gather collections of such subheadings to share
+      a common description.
+    </para>
+    <section xml:id="std.io.fs.ops.lstats" xreflabel="Status Checks"><info><title>Status 
Checks</title></info>
+      <para>
+       All of
+       <function>filesystem::is_regular_file</function>,
+       <function>is_directory</function>,
+       <function>is_character_file</function>,
+       <function>is_block_file</function>
+       <function>is_fifo</function>,
+       <function>is_symlink</function>,
+       <function>is_socket</function>,
+       <function>file_size</function>,
+       <function>hard_link_count</function>,
+       <function>is_empty</function>,
+       <function>last_write_time</function>
+       <function>is_other</function>,
+       <function>status</function>, and
+       <function>symlink_status</function>
+       use POSIX <function>lstat</function>.

And so these cannot race. There's a single call to lstat, so you get a
result that is always correct (for the atomic instant that the OS
evaluates lstat). FS-Atomic, to use the poorly-named property
discussed above.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.paths" xreflabel="Filesystem Path 
Checks"><info><title>Filesystem Path Checks</title></info>
+      <para>
+       All of
+       <function>filesystem::canonical</function>,

This first turns any relative path into an absolute one
using filesystem::absolute (which doesn't seem to be documented here?
It depends on filesystem::current_path() for POSIX, and
GetFullPathNameW for Windows). So there's a possible FS race for an
absolute path

Then it uses POSIX realpath on the absolute path, if that's available.
For a relative path, there's a possible FS race between forming the
absolute path and then canonicalizing it with realpath, but apart from
that the behaviour is usually entirely specified by POSIX realpath.
The exceptions are for systems without realpath (including Windows) or
when the absolute path is logner than PATH_MAX. In that case, there's
a loop over every component of the absolute path which uses
is_directory and is_symlink+read_symlink, which is super-racy.

+       <function>proximate</functiofunction>,

Mismatched end tag.

This is fully specified in terms of other std::filesystem functions,
so its behaviour depends entirely on those.

+       <function>relative</function>, and

Ditto.

+       <function>weakly_canonical</function> use POSIX

Super racy. TOCTTOU race between calling status and canonical, which
is done for each component of the path. For the non-existent
components it's a purely "lexical" operation, so only operates on the
string contents of the fs::path, without touching the file system.

Maybe introduce a concept like "Racy-Loop" for these cases!

Or a more general concept for "composite operation" meaning that it
strings several steps together and so has inherent TOCTTOU races ...
although I guess that's implied by "not FS-Atomic".

I think it makes sense to describe weakly_canonical in terms of calls
to status and canonical, it doesn't directly call stat, lstat, or
readlink.

+       <function>realpath</function>, and may call
+       <function>fstat</function>, <function>lstat</function>, and
+       <function>readlink</function> on the argument
+       and its parent directories.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.copy" 
xreflabel="Copy"><info><title>Copy</title></info>
+      <para>
+       <function>filesystem::copy</function> uses POSIX

This is almost entirely specified in the standard as a series of
operations using other std::filesystem calls. The only exceptions are
"create a symbolic link" and "create a hard link". I think it makes
sense to just say it performs the steps shown in the standard, which
is inherently racy because things can change between the initial calls
to symlink_status/status and the subsequent copying/creating files.

There should be no UB at any stage, but plenty of opportunity for
TOCTTOU problems if the FS is changing.

+       <function>fstat</function> and possibly
+       <function>readlink</function>, and
+       <function>symlink</function>, <function>link</function>,
+       <function>mkdir</function>, or operations listed under
+       <function>filesystem::copy_file</function>,
+       possibly repeatedly if the argument is a directory.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.copy_file" xreflabel="Copy File"><info><title>Copy 
File</title></info>
+      <para>
+       <function>filesystem::copy_file</function> uses POSIX
+       <function>lstat</function>, <function>open</function> on both
+       source and destination, and GNU
+       <function>copy_file_range</function> or

This is a Linux function, not GNU.

+       <function>send_file</function>

Ditto (and it's sendfile without the underscore).

+       where available and applicable, or else POSIX
+       <function>read</function> and
+       <function>write</function> repeatedly, and then
+       <function>close</function> on both.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.copy_symlink" xreflabel="Copy 
Symlink"><info><title>Copy Symlink</title></info>
+      <para>
+       <function>filesystem::copy_symlink</function> uses POSIX
+       <function>lstat</function> and <function>symlink</function>.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.mkdirs" xreflabel="Create 
Directories"><info><title>Create Directories</title></info>
+      <para>
+       <function>filesystem::create_directories</function> uses POSIX
+       <function>fstat</function>, and then <function>mkdir</function>, 
possibly
+       repeatedly.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.mkdir" xreflabel="Create Directory"><info><title>Create 
Directory</title></info>
+      <para>
+       <function>filesystem::create_directory</function> uses POSIX
+       <function>mkdir</function>.

The overloads taking one path are FS-Atomic, but the overloads taking
two paths use stat to get the permissions from an existing file, then
use mkdir. That's not atomic, but neither step can have any UB.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.mkdirsymlink" xreflabel="Create Directory 
Symlink"><info><title>Create Directory Symlink</title></info>
+      <para>
+       <function>filesystem::create_directory_symlink</function> uses POSIX
+       <function>symlink</function>.

I think it would be simpler to say that for POSIX it is just a call to
create_symlink. It doesn't use symlink directly, it just calls another
function which is documented below.

(Windows symlink support should be arriving in GCC 17, I have a patch
to review.)

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.mklink" xreflabel="Create Hard 
Link"><info><title>Create Hard Link</title></info>
+      <para>
+       <function>filesystem::create_hard_link</function> uses POSIX
+       <function>link</function>.

FS-Atomic

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.mksymlink" xreflabel="Create 
Symlink"><info><title>Create Symlink</title></info>
+      <para>
+       <function>filesystem::create_symlink</function> uses POSIX
+       <function>symlink</function>.
+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.cwd" xreflabel="Current Path"><info><title>Current 
Path</title></info>
+      <para>
+       <function>filesystem::current_path</function> uses POSIX
+       <function>getcwd</function>. This accesses process global
+       state, which may be changed at any time by another
+       thread.

But is still FS-Atomic. You only get problems when using the path it
returns for subsequent operations, and those problems are not
inherently UB, they're just racy.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.equiv" 
xreflabel="Equivalent"><info><title>Equivalent</title></info>
+      <para>
+       <function>filesystem::equivalent</function> uses a sequence
+       of calls to POSIX
+       <function>lstat</function> and, possibly,

stat not lstat

+       <function>readlink</function>.

I don't think this is correct.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.perms" 
xreflabel="Permissions"><info><title>Permissions</title></info>
+      <para>
+       <function>filesystem::permissions</function> uses POSIX
+       <function>lstat</function>, and <function>fchmodat</function>.

It uses either stat or lstat, depending on the arguments.

If fchmodat is available it uses that, otherwise, it uses chmod.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.rdsym" xreflabel="Read Symlink"><info><title>Read 
Symlink</title></info>
+      <para>
+       <function>filesystem::read_symlink</function> uses POSIX
+       <function>lstat</function> and <function>readlink</function>.

We should probably change fs::read_symlink to attempt a single call to
readlink with a "reasonable" buffer size, and only use lstat if
readlink returns bufsiz (meaning truncation might have occurred) and
we need to resize the buffer.

For reasonable sized paths it could be FS-Atomic.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.rm" 
xreflabel="Remove"><info><title>Remove</title></info>
+      <para>
+       <function>filesystem::remove</function> uses POSIX
+       <function>remove</function>.

FS-Atomic for POSIX, not for Windows.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.rmall" xreflabel="Remove All"><info><title>Remove 
All</title></info>
+      TODO

This was the original motivation for writing these docs :-)

CVE-2022-21658 for Rust's equivalent is relevant here.

+    </section>
+    <section xml:id="std.io.fs.ops.mv" 
xreflabel="Rename"><info><title>Rename</title></info>
+      <para>
+       <function>filesystem::rename</function> uses POSIX
+       <function>rename</function>.

FS-Atomic for POSIX, not for Windows.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.resize" xreflabel="Resize File"><info><title>Resize 
File</title></info>
+      <para>
+       <function>filesystem::resize_file</function> uses POSIX
+       <function>truncate</function>.

... if available, in which case it's FS-Atomic.

Otherwise it opens the file to get a file descriptor then uses
ftruncate ... which is still FS-Atomic, I think?

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.space" 
xreflabel="Space"><info><title>Space</title></info>
+      <para>
+       <function>filesystem::space</function> uses POSIX
+       <function>statvfs</function>.

... if available, in which case it's FS-Atomic.

For Windows it uses filesystem::absolute then GetDiskFreeSpaceExW
which means it depends on CWD.

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.statknown" xreflabel="Status Known"><info><title>Status 
Known</title></info>
+    </section>
+      <para>
+       <function>filesystem::status_known</function> uses no POSIX operations.

Entirely specified in terms of other std::filesystem operations, so
doesn't need to be documented (as long as we have some front matter
saying we don't bother documenting such functions).

+      </para>
+    </section>
+    <section xml:id="std.io.fs.ops.tmpdir" xreflabel="Temporary Directory 
Path"><info><title>Temporary Directory Path</title></info>
+      <para>
+       <function>filesystem::temp_directory_path</function> uses the
+       GNU extension <function>secure_getenv</function> if available,
+       or else POSIX <function>getenv</function> to check the process
+       environment sequentially for a definition of
+       <code>"TMPDIR"</code>,
+       <code>"TMP</code>,
+       <code>"TEMP"</code>, and
+       <code>"TEMPDIR"</code>, in that order, immediately
+       returning a path constructed from the first match.
+       These calls access process global state, which may be changed
+       at any time by another thread.

e.g., by calls to setenvC, unsetenv, or putenv.

+      </para>
+    </section>
+  </section>
+</section>
</chapter>
--
2.53.0



Reply via email to