On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <[email protected]> wrote:
> On Tue, 10 Feb 2026 05:13:47 -0500
> "Alyssa M via 9fans" <[email protected]> wrote:
>
> > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > as for mmap, there's already a defacto mmap happening for executables. 
> > > They are not read into memory. In fact, the first instruction you run in 
> > > a binary results in a page fault.
> > I thinking one could bring the same transparent/defacto memory mapping to 
> > read(2) and write(2), so the API need not change at all.
>
> That gets... interesting, from an FS semantics point of view.
> What does this code print? Does it change with buffer sizes?
>
>         fd = open("x", ORDWR);
>         pwrite(fd, "foo", 4, 0);
>         read(fd, buf, 4);
>         pwrite(fd, "bar", 4, 0);
>         print("%s\n", buf);

It depends.  Is `buf` some buffer on your stack or something similar
(a global, static buffer, or heap-malloc'ed perhaps)? If so,
presumably it still prints "foo", since the `read` would have copied
the data out of any shared region and into process-private memory. Or,
is it a pointer to the start of some region that you mapped to "x"?
In that case, the whole program is suspect as it seems to operate well
outside of the assumptions of C, but on Plan 9, I'd kind of expect it
to print "bar".

Perhaps a better example to illustrate the challenge Ron was referring
to is to consider two processes, A and B: A opens a file for write, B
opens a file and then maps it read-only and shared.  The sequence of
events is then that, B dereferences a pointer into the region it
mapped and reads the value there; then A seeks to that location, reads
the value, updates it in some way (say, increments and integer or
something), seeks back to the location in question and writes the new
value. B then reads through that pointer a second time; what value
does B see?  Here, the answer depends on the implementation.

Early Unix synchronized file state between memory and disk by
channeling everything through the buffer cache; `write` actually
copied into the buffer cache, and dirty buffers were copied out to the
disk asynchronously; similarly, `read` copied data out of the cache,
and if a block was already in memory, it just copied it; but if the
block needed to be read in from disk to fulfill the `read`, a buffer
was allocated, the transfer from disk to buffer scheduled, and the
calling process was suspended until the transfer completed, at which
point the data was copied out of the buffer. For large reads and
writes, this process could repeat many times and the details could get
complex, especially as the number of buffers was fixed and relatively
small (especially on a PDP-11). But, with a little tuning and some
cooperation from programs, a lot of unnecessary IO could be avoided,
and it dramatically simplified synchronizing between processes. Unix
semantics said that reads and writes were "atomic" from the
perspective of other users of the filesystem: a partial write in
progress could not be observed by an outside reader, for example: this
was done by putting locks on inodes during read/write, and buffers
could be similarly locked while transferring from disk, and so on.

Similarly, Unix has used "virtual memory" almost from the state, even
on the PDP-11: the virtual address space was very small, but Unix
processes were mapped into virtual segments that started at address 0;
the kernel was mapped separately, and a region of memory shared
between the two (the "user area") was mapped at the top of data space
so that user processes could pass arguments into the kernel, and so
on. But notably, this was used mainly for protection and simplifying
user programs (which could use absolute addresses within their virtual
address space); the system was still firmly swap based: the address
space was too small to usefully support demand paging, and it's not
clear to me that the PDP-11 stored enough information when trapping to
restart an instruction after a page fault. Even on the PDP-11, there
was some sharing, though; cf the "sticky bit" for the text of
frequently-run executables.

When Unix was moved to the VAX, the address space was greatly
expanded, and demand paging was added in 3BSD and in Reiser's
elaboration of the work started with 32/V inside Bell Labs. To
accommodate sharing of demand paged segments, a separate "page cache"
was introduced, but this was essentially read-only, and was separate
from the buffer cached, used for IO with open/close/read/write etc.
Critically, entries in the page cache were page sized, while entries
in the buffer cache were block sized, and the two weren't necessarily
the same.

With larger address spaces, programmers wanted to use shared memory as
an alternative to traditional forms of IPC, like pipes, and the `mmap`
_design_, based on the `PMAP` call on TENEX/TOPS-20, was presented
with 4.2BSD, but not implemented. Sun did an independent
implementation for SunOS as did many of the commercial Unix vendors;
Berkeley eventually did an implementation in 4.4BSD (btw, there was
some talk that Sun would donate their VM system to Berkeley for 4.4,
but those talks fell through, so CSRG adopted the Mach VM, instead).
By then, Linux had down their own as well. System V had its own shared
memory stuff that was different, but the industry writ large more or
less settled on mmap by the end of mid- to late-1990s.  `mmap` gave
programs a lot more control over the address space of the process they
run in, which led to things like supporting shared libraries and so
on.  The Sun paper on doing that in SunOS4 is pretty interesting, btw;
shared library support is basically implemented outside of the kernel,
with a little cooperation from the C startup code in libc:
http://mcvoy.com/lm/papers/SunOS.shlib.pdf

But `mmap` worked in terms of the VM page cache, and with writable
mappings, the page cache necessarily became read/write. But
open/close/read/write continued to use the buffer cache, which was
separate, which meant that a region of any given file might be present
in both caches simultaneously, and with no interlock between the two,
a write to one wouldn't necessarily update the other.  If programs
were using both for IO on a file simultaneously, the caches could
easily fall out of sync and the state of the file on disk would be
indeterminate. Unifying the two caches addressed this, and SunOS did
that in the 1980s, but which took a long time to get right and
ultimately wasn't done for everything (directories remained in the
buffer cache but not the page cache). Something that aggravates some
old time Sun engineers is that when ZFS was implemented in Solaris, it
used its own cache (the ARC) that wasn't synchronized with the page
cache, undoing much of the earlier work.  The ZFS implementors mostly
don't care, and note that using `mmap` for output is fraught:
https://db.cs.cmu.edu/mmap-cidr2022/

When Plan 9 came along, the entire architecture changed. Sure,
executables are faulted in on demand and things like `segattach` exist
to support memory-mapped IO devices and the like, but there's no real
equivalent to the full generality of `mmap`, and no shared libraries.
here are legitimate questions about synchronization when mapping a
file served by a remote server somewhere, and error handling is a
challenge: how does one detect a write failure via a store into a
mapped region? Presumably, that's reflected into an exception that's
delivered to the process in the form of a note or something (if you
really want to twist your noggin around, look at how Multics did it.
Interestingly, the semantics of deleting a segment on Multics are
closer to how Plan 9 deals with deleting a currently open file than
unlink'ing a name on Unix, though of course that's dependent on the
backing file server). But if error handling for a _store_ involves a
round-trip to/from some distant machine, that can be painful.

`mmap` is a lot easier to get right on a single machine with local
storage and no networking at play and yes, Sun did it for NFS (to
support running dynamically linked executables from NFS-mounted
filesystems), but it's a pain and the semantics evolved informally,
instead of being well-defined from the start.

Bottom line: it's not trivial. Fortunately, if someone wants to make a
go of it, there's close to 50 years worth of prior art to learn from.

        - Dan C.

------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Me8f735d3c62aac435db5b793
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to