On Wed, 11 Feb 2026 09:22:06 -0500 Dan Cross <[email protected]> wrote:
> On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <[email protected]> wrote: > > On Tue, 10 Feb 2026 05:13:47 -0500 > > "Alyssa M via 9fans" <[email protected]> wrote: > > > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote: > > > > as for mmap, there's already a defacto mmap happening for executables. > > > > They are not read into memory. In fact, the first instruction you run > > > > in a binary results in a page fault. > > > I thinking one could bring the same transparent/defacto memory mapping to > > > read(2) and write(2), so the API need not change at all. > > > > That gets... interesting, from an FS semantics point of view. > > What does this code print? Does it change with buffer sizes? > > > > fd = open("x", ORDWR); > > pwrite(fd, "foo", 4, 0); > > read(fd, buf, 4); > > pwrite(fd, "bar", 4, 0); > > print("%s\n", buf); > > It depends. Is `buf` some buffer on your stack or something similar > (a global, static buffer, or heap-malloc'ed perhaps)? If so, > presumably it still prints "foo", since the `read` would have copied > the data out of any shared region and into process-private memory. Or, > is it a pointer to the start of some region that you mapped to "x"? > In that case, the whole program is suspect as it seems to operate well > outside of the assumptions of C, but on Plan 9, I'd kind of expect it > to print "bar". In this example, no trickery; single threaded code, nothing fancy. Currently, it does what you suggest, but if you did transparent mapping of the file if the alignment worked, would that be guaranteed? What if another process wrote to the file after the read? what if a different machine did? Right now, eagerly copying works with no surprised, but lazily mmapping adds a lot of things to get right. > > Perhaps a better example to illustrate the challenge Ron was referring > to is to consider two processes, A and B: A opens a file for write, B > opens a file and then maps it read-only and shared. The sequence of > events is then that, B dereferences a pointer into the region it > mapped and reads the value there; then A seeks to that location, reads > the value, updates it in some way (say, increments and integer or > something), seeks back to the location in question and writes the new > value. B then reads through that pointer a second time; what value > does B see? Here, the answer depends on the implementation. My point is that you don't even need to get very fancy to get semantic difficulties with explicit mmap, you just need to realize that reads can get delayed by arbitrary amounts of time, allowing anyone to come in and modify things behind the program's back. I don't think you can get it right without leaking a great deal of how the file has been mapped back into the file server. > Early Unix synchronized file state between memory and disk by > channeling everything through the buffer cache; `write` actually > copied into the buffer cache, and dirty buffers were copied out to the > disk asynchronously; similarly, `read` copied data out of the cache, > and if a block was already in memory, it just copied it; but if the > block needed to be read in from disk to fulfill the `read`, a buffer > was allocated, the transfer from disk to buffer scheduled, and the > calling process was suspended until the transfer completed, at which > point the data was copied out of the buffer. For large reads and > writes, this process could repeat many times and the details could get > complex, especially as the number of buffers was fixed and relatively > small (especially on a PDP-11). But, with a little tuning and some > cooperation from programs, a lot of unnecessary IO could be avoided, > and it dramatically simplified synchronizing between processes. Unix > semantics said that reads and writes were "atomic" from the > perspective of other users of the filesystem: a partial write in > progress could not be observed by an outside reader, for example: this > was done by putting locks on inodes during read/write, and buffers > could be similarly locked while transferring from disk, and so on. > > Similarly, Unix has used "virtual memory" almost from the state, even > on the PDP-11: the virtual address space was very small, but Unix > processes were mapped into virtual segments that started at address 0; > the kernel was mapped separately, and a region of memory shared > between the two (the "user area") was mapped at the top of data space > so that user processes could pass arguments into the kernel, and so > on. But notably, this was used mainly for protection and simplifying > user programs (which could use absolute addresses within their virtual > address space); the system was still firmly swap based: the address > space was too small to usefully support demand paging, and it's not > clear to me that the PDP-11 stored enough information when trapping to > restart an instruction after a page fault. Even on the PDP-11, there > was some sharing, though; cf the "sticky bit" for the text of > frequently-run executables. > > When Unix was moved to the VAX, the address space was greatly > expanded, and demand paging was added in 3BSD and in Reiser's > elaboration of the work started with 32/V inside Bell Labs. To > accommodate sharing of demand paged segments, a separate "page cache" > was introduced, but this was essentially read-only, and was separate > from the buffer cached, used for IO with open/close/read/write etc. > Critically, entries in the page cache were page sized, while entries > in the buffer cache were block sized, and the two weren't necessarily > the same. > > With larger address spaces, programmers wanted to use shared memory as > an alternative to traditional forms of IPC, like pipes, and the `mmap` > _design_, based on the `PMAP` call on TENEX/TOPS-20, was presented > with 4.2BSD, but not implemented. Sun did an independent > implementation for SunOS as did many of the commercial Unix vendors; > Berkeley eventually did an implementation in 4.4BSD (btw, there was > some talk that Sun would donate their VM system to Berkeley for 4.4, > but those talks fell through, so CSRG adopted the Mach VM, instead). > By then, Linux had down their own as well. System V had its own shared > memory stuff that was different, but the industry writ large more or > less settled on mmap by the end of mid- to late-1990s. `mmap` gave > programs a lot more control over the address space of the process they > run in, which led to things like supporting shared libraries and so > on. The Sun paper on doing that in SunOS4 is pretty interesting, btw; > shared library support is basically implemented outside of the kernel, > with a little cooperation from the C startup code in libc: > http://mcvoy.com/lm/papers/SunOS.shlib.pdf > > But `mmap` worked in terms of the VM page cache, and with writable > mappings, the page cache necessarily became read/write. But > open/close/read/write continued to use the buffer cache, which was > separate, which meant that a region of any given file might be present > in both caches simultaneously, and with no interlock between the two, > a write to one wouldn't necessarily update the other. If programs > were using both for IO on a file simultaneously, the caches could > easily fall out of sync and the state of the file on disk would be > indeterminate. Unifying the two caches addressed this, and SunOS did > that in the 1980s, but which took a long time to get right and > ultimately wasn't done for everything (directories remained in the > buffer cache but not the page cache). Something that aggravates some > old time Sun engineers is that when ZFS was implemented in Solaris, it > used its own cache (the ARC) that wasn't synchronized with the page > cache, undoing much of the earlier work. The ZFS implementors mostly > don't care, and note that using `mmap` for output is fraught: > https://db.cs.cmu.edu/mmap-cidr2022/ > > When Plan 9 came along, the entire architecture changed. Sure, > executables are faulted in on demand and things like `segattach` exist > to support memory-mapped IO devices and the like, but there's no real > equivalent to the full generality of `mmap`, and no shared libraries. > here are legitimate questions about synchronization when mapping a > file served by a remote server somewhere, and error handling is a > challenge: how does one detect a write failure via a store into a > mapped region? Presumably, that's reflected into an exception that's > delivered to the process in the form of a note or something (if you > really want to twist your noggin around, look at how Multics did it. > Interestingly, the semantics of deleting a segment on Multics are > closer to how Plan 9 deals with deleting a currently open file than > unlink'ing a name on Unix, though of course that's dependent on the > backing file server). But if error handling for a _store_ involves a > round-trip to/from some distant machine, that can be painful. > > `mmap` is a lot easier to get right on a single machine with local > storage and no networking at play and yes, Sun did it for NFS (to > support running dynamically linked executables from NFS-mounted > filesystems), but it's a pain and the semantics evolved informally, > instead of being well-defined from the start. > > Bottom line: it's not trivial. Fortunately, if someone wants to make a > go of it, there's close to 50 years worth of prior art to learn from. > > - Dan C. -- Ori Bernstein <[email protected]> ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mecd608135b0c0e70a20c7040 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
