On Wed, Feb 11, 2026 at 1:44 PM Ori Bernstein <[email protected]> wrote:
> On Wed, 11 Feb 2026 09:22:06 -0500 Dan Cross <[email protected]> wrote:
> > On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <[email protected]> wrote:
> > > On Tue, 10 Feb 2026 05:13:47 -0500
> > > "Alyssa M via 9fans" <[email protected]> wrote:
> > >
> > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > > > as for mmap, there's already a defacto mmap happening for
> > > > > executables. They are not read into memory. In fact, the first
> > > > > instruction you run in a binary results in a page fault.
> > > > I thinking one could bring the same transparent/defacto memory mapping
> > > > to read(2) and write(2), so the API need not change at all.
> > >
> > > That gets... interesting, from an FS semantics point of view.
> > > What does this code print? Does it change with buffer sizes?
> > >
> > > fd = open("x", ORDWR);
> > > pwrite(fd, "foo", 4, 0);
> > > read(fd, buf, 4);
> > > pwrite(fd, "bar", 4, 0);
> > > print("%s\n", buf);
> >
> > It depends. Is `buf` some buffer on your stack or something similar
> > (a global, static buffer, or heap-malloc'ed perhaps)? If so,
> > presumably it still prints "foo", since the `read` would have copied
> > the data out of any shared region and into process-private memory. Or,
> > is it a pointer to the start of some region that you mapped to "x"?
> > In that case, the whole program is suspect as it seems to operate well
> > outside of the assumptions of C, but on Plan 9, I'd kind of expect it
> > to print "bar".
>
> In this example, no trickery; single threaded code, nothing fancy.
Ok. Perhaps implicitly you also mean that there's no `mmap` involved?
> Currently, it does what you suggest, but if you did transparent
> mapping of the file if the alignment worked, would that be guaranteed?
I don't think that changes my earlier conclusion, which is it depends
entirely on where `buf` is.
> What if another process wrote to the file after the read? what if
> a different machine did?
All of these things are issues, but I don't see how that changes your
example, which I don't think illustrates the problems with mmap
specifically very convincingly.
1. You write a few bytes to some file referred to by `fd` at offset 0,
2. You then read from offset 0 in that same file into some buffer called `buf`,
3. You then write some bytes to that same file, again at offset 0,
4. Finally, you print the contents of `buf` and ask what it displays.
As before, the answer is entirely, "it depends." There are too many
unknowns, including critically, that you didn't say where _buf_ lives;
that's really the thing that matters vis `mmap` in this example.
> Right now, eagerly copying works with no surprised, but lazily
> mmapping adds a lot of things to get right.
I don't see how it's any appreciably different from the program's
perspective, unless `buf` is somehow aliasing the beginning of a
region that was `mmap`'ed to the file you are mutating via `pwrite`.
But that's weird, and all bets are kinda of off in that case; I view
that as not as bad, but sort of in the same villainous league, as
opening /proc/$pid/mem and grubbing around.
If `buf` is in your proc-local stack segment, however, then I don't
expect anything to change it once read, regardless of what you do to
the file, whether mmapped or manipulated via explicit write calls.
With `read`, the contents may not be the same as what you wrote, but
that's true today: did another process open it and write to it, after
your first `pwrite` but before your `read`? Maybe; depending on the
file and the way it was opened (does it have the `l` bit set?). If so,
it could print anything; it may print "baz" for all we know. For that
matter, you don't even know if the `buf` is nul terminated when you
pass it to `print("%s\n");`. Are you racing against another proc you
yourself `rfork`'ed? Maybe, but you said no, so I'll assume not.
Anyway, my point wasn't that there aren't issues; there demonstrably
are. But I think that if you want to provide a good example for the
dangers of `mmap` here, something like this is better:
int fd = open("something", OREAD|OWRITE);
char *p = mmap(fd, RD|WR, len, etc);
char a[4], b[4];
pwrite(fd, "foo", 4, 0);
read(fd, a, 4, 0);
memmove(p, "bar", 4);
pread(fd, b, 4, 0);
if (memcmp(a, b, 4) == 0)
print("same same\n");
else
print("but different\n");
Let's assume the happy path where this isn't racing against anything
else mutating "something", nothing crashes, or anything of that
nature: in that case, I assert that it could print either string, but
that `a` unconditionally contains { 'f', 'o', 'o', '\0' } while the
contents of `b` might be either { 'f', 'o', 'o', '\0' } or they might
be { 'b', 'a', 'r', '\0' } or some combination---all depending on the
semantics of `mmap` and how it's implemented, and, perhaps
surprisingly, `memmove` and it's implementation and the particular
properties of where `a` and `b` wind up in memory.
> > Perhaps a better example to illustrate the challenge Ron was referring
> > to is to consider two processes, A and B: A opens a file for write, B
> > opens a file and then maps it read-only and shared. The sequence of
> > events is then that, B dereferences a pointer into the region it
> > mapped and reads the value there; then A seeks to that location, reads
> > the value, updates it in some way (say, increments and integer or
> > something), seeks back to the location in question and writes the new
> > value. B then reads through that pointer a second time; what value
> > does B see? Here, the answer depends on the implementation.
>
> My point is that you don't even need to get very fancy to get
> semantic difficulties with explicit mmap, you just need to realize
> that reads can get delayed by arbitrary amounts of time, allowing
> anyone to come in and modify things behind the program's back.
Yes, but that's true today with read and write; I don't see what that
has to do with mmap specifically. Also, I think I said as much in the
remainder of my response.
> I don't think you can get it right without leaking a great deal
> of how the file has been mapped back into the file server.
It depends on what how you define your semantics.
If you declare that writes are coherent with reads, then yeah, you
need some sort of callback mechanism to let a reader know that their
cached copy is now stale, which means you need to track a lot of state
on the server, and presumably you need to set things up to fault on
writes so that you can reflect that back to the server when a program
does a store; that is going to be bad even for file-backed regions,
where every store in a `memmove` is a fault and a round-trip, but it's
really going to really suck for anonymous memory (...so don't do that
for anon mem, which by definition isn't going to some server somewhere
anyway). But maybe you rely on lending out leases with a known
expiration date to clients; if a write faults, the client has to go to
a server, get a lease, remap the region writeable, restart the
operation, and then schedule a callback to revoke write access to the
region and flush its contents back up to the server when the lease
expires (or right before that, maybe).
If you declare instead that all bets are off and you can't write-map
with sharing, so stores won't be reflected back to writes on the
server, and that it's up to programs to synchronize writes themselves
using some external means, then it's not all that different than
open/close/read/write today in the case where you open a file, read a
copy locally, and work from that copy. It someone else writes to that
file while you're memmove'ing out of the shared region, you could see
inconsistent data, but you've already said that all bets are off,
so...oh well. Besides, someone can be writing some largish amount of
data to that same file in a loop using plain 'ol `write` a chunk at a
time, and you could that the result of one of those writes before the
whole thing is done.
Anyway. Yeah, it's a hard problem, but all of these techniques have
been implemented before, on a variety of different systems, with
greater or lesser degrees of success. It's all there in the literature
to see what has worked on systems in the past, and what has not.
- Dan C.
> > Early Unix synchronized file state between memory and disk by
> > channeling everything through the buffer cache; `write` actually
> > copied into the buffer cache, and dirty buffers were copied out to the
> > disk asynchronously; similarly, `read` copied data out of the cache,
> > and if a block was already in memory, it just copied it; but if the
> > block needed to be read in from disk to fulfill the `read`, a buffer
> > was allocated, the transfer from disk to buffer scheduled, and the
> > calling process was suspended until the transfer completed, at which
> > point the data was copied out of the buffer. For large reads and
> > writes, this process could repeat many times and the details could get
> > complex, especially as the number of buffers was fixed and relatively
> > small (especially on a PDP-11). But, with a little tuning and some
> > cooperation from programs, a lot of unnecessary IO could be avoided,
> > and it dramatically simplified synchronizing between processes. Unix
> > semantics said that reads and writes were "atomic" from the
> > perspective of other users of the filesystem: a partial write in
> > progress could not be observed by an outside reader, for example: this
> > was done by putting locks on inodes during read/write, and buffers
> > could be similarly locked while transferring from disk, and so on.
> >
> > Similarly, Unix has used "virtual memory" almost from the state, even
> > on the PDP-11: the virtual address space was very small, but Unix
> > processes were mapped into virtual segments that started at address 0;
> > the kernel was mapped separately, and a region of memory shared
> > between the two (the "user area") was mapped at the top of data space
> > so that user processes could pass arguments into the kernel, and so
> > on. But notably, this was used mainly for protection and simplifying
> > user programs (which could use absolute addresses within their virtual
> > address space); the system was still firmly swap based: the address
> > space was too small to usefully support demand paging, and it's not
> > clear to me that the PDP-11 stored enough information when trapping to
> > restart an instruction after a page fault. Even on the PDP-11, there
> > was some sharing, though; cf the "sticky bit" for the text of
> > frequently-run executables.
> >
> > When Unix was moved to the VAX, the address space was greatly
> > expanded, and demand paging was added in 3BSD and in Reiser's
> > elaboration of the work started with 32/V inside Bell Labs. To
> > accommodate sharing of demand paged segments, a separate "page cache"
> > was introduced, but this was essentially read-only, and was separate
> > from the buffer cached, used for IO with open/close/read/write etc.
> > Critically, entries in the page cache were page sized, while entries
> > in the buffer cache were block sized, and the two weren't necessarily
> > the same.
> >
> > With larger address spaces, programmers wanted to use shared memory as
> > an alternative to traditional forms of IPC, like pipes, and the `mmap`
> > _design_, based on the `PMAP` call on TENEX/TOPS-20, was presented
> > with 4.2BSD, but not implemented. Sun did an independent
> > implementation for SunOS as did many of the commercial Unix vendors;
> > Berkeley eventually did an implementation in 4.4BSD (btw, there was
> > some talk that Sun would donate their VM system to Berkeley for 4.4,
> > but those talks fell through, so CSRG adopted the Mach VM, instead).
> > By then, Linux had down their own as well. System V had its own shared
> > memory stuff that was different, but the industry writ large more or
> > less settled on mmap by the end of mid- to late-1990s. `mmap` gave
> > programs a lot more control over the address space of the process they
> > run in, which led to things like supporting shared libraries and so
> > on. The Sun paper on doing that in SunOS4 is pretty interesting, btw;
> > shared library support is basically implemented outside of the kernel,
> > with a little cooperation from the C startup code in libc:
> > http://mcvoy.com/lm/papers/SunOS.shlib.pdf
> >
> > But `mmap` worked in terms of the VM page cache, and with writable
> > mappings, the page cache necessarily became read/write. But
> > open/close/read/write continued to use the buffer cache, which was
> > separate, which meant that a region of any given file might be present
> > in both caches simultaneously, and with no interlock between the two,
> > a write to one wouldn't necessarily update the other. If programs
> > were using both for IO on a file simultaneously, the caches could
> > easily fall out of sync and the state of the file on disk would be
> > indeterminate. Unifying the two caches addressed this, and SunOS did
> > that in the 1980s, but which took a long time to get right and
> > ultimately wasn't done for everything (directories remained in the
> > buffer cache but not the page cache). Something that aggravates some
> > old time Sun engineers is that when ZFS was implemented in Solaris, it
> > used its own cache (the ARC) that wasn't synchronized with the page
> > cache, undoing much of the earlier work. The ZFS implementors mostly
> > don't care, and note that using `mmap` for output is fraught:
> > https://db.cs.cmu.edu/mmap-cidr2022/
> >
> > When Plan 9 came along, the entire architecture changed. Sure,
> > executables are faulted in on demand and things like `segattach` exist
> > to support memory-mapped IO devices and the like, but there's no real
> > equivalent to the full generality of `mmap`, and no shared libraries.
> > here are legitimate questions about synchronization when mapping a
> > file served by a remote server somewhere, and error handling is a
> > challenge: how does one detect a write failure via a store into a
> > mapped region? Presumably, that's reflected into an exception that's
> > delivered to the process in the form of a note or something (if you
> > really want to twist your noggin around, look at how Multics did it.
> > Interestingly, the semantics of deleting a segment on Multics are
> > closer to how Plan 9 deals with deleting a currently open file than
> > unlink'ing a name on Unix, though of course that's dependent on the
> > backing file server). But if error handling for a _store_ involves a
> > round-trip to/from some distant machine, that can be painful.
> >
> > `mmap` is a lot easier to get right on a single machine with local
> > storage and no networking at play and yes, Sun did it for NFS (to
> > support running dynamically linked executables from NFS-mounted
> > filesystems), but it's a pain and the semantics evolved informally,
> > instead of being well-defined from the start.
> >
> > Bottom line: it's not trivial. Fortunately, if someone wants to make a
> > go of it, there's close to 50 years worth of prior art to learn from.
------------------------------------------
9fans: 9fans
Permalink:
https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M06d2c26f15f53a2c0240aff0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription