On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <[email protected]> wrote: > On Tue, 10 Feb 2026 05:13:47 -0500 > "Alyssa M via 9fans" <[email protected]> wrote: > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote: > > > as for mmap, there's already a defacto mmap happening for executables. > > > They are not read into memory. In fact, the first instruction you run in > > > a binary results in a page fault. > > I thinking one could bring the same transparent/defacto memory mapping to > > read(2) and write(2), so the API need not change at all. > > That gets... interesting, from an FS semantics point of view. > What does this code print? Does it change with buffer sizes? > > fd = open("x", ORDWR); > pwrite(fd, "foo", 4, 0); > read(fd, buf, 4); > pwrite(fd, "bar", 4, 0); > print("%s\n", buf);
It depends. Is `buf` some buffer on your stack or something similar (a global, static buffer, or heap-malloc'ed perhaps)? If so, presumably it still prints "foo", since the `read` would have copied the data out of any shared region and into process-private memory. Or, is it a pointer to the start of some region that you mapped to "x"? In that case, the whole program is suspect as it seems to operate well outside of the assumptions of C, but on Plan 9, I'd kind of expect it to print "bar". Perhaps a better example to illustrate the challenge Ron was referring to is to consider two processes, A and B: A opens a file for write, B opens a file and then maps it read-only and shared. The sequence of events is then that, B dereferences a pointer into the region it mapped and reads the value there; then A seeks to that location, reads the value, updates it in some way (say, increments and integer or something), seeks back to the location in question and writes the new value. B then reads through that pointer a second time; what value does B see? Here, the answer depends on the implementation. Early Unix synchronized file state between memory and disk by channeling everything through the buffer cache; `write` actually copied into the buffer cache, and dirty buffers were copied out to the disk asynchronously; similarly, `read` copied data out of the cache, and if a block was already in memory, it just copied it; but if the block needed to be read in from disk to fulfill the `read`, a buffer was allocated, the transfer from disk to buffer scheduled, and the calling process was suspended until the transfer completed, at which point the data was copied out of the buffer. For large reads and writes, this process could repeat many times and the details could get complex, especially as the number of buffers was fixed and relatively small (especially on a PDP-11). But, with a little tuning and some cooperation from programs, a lot of unnecessary IO could be avoided, and it dramatically simplified synchronizing between processes. Unix semantics said that reads and writes were "atomic" from the perspective of other users of the filesystem: a partial write in progress could not be observed by an outside reader, for example: this was done by putting locks on inodes during read/write, and buffers could be similarly locked while transferring from disk, and so on. Similarly, Unix has used "virtual memory" almost from the state, even on the PDP-11: the virtual address space was very small, but Unix processes were mapped into virtual segments that started at address 0; the kernel was mapped separately, and a region of memory shared between the two (the "user area") was mapped at the top of data space so that user processes could pass arguments into the kernel, and so on. But notably, this was used mainly for protection and simplifying user programs (which could use absolute addresses within their virtual address space); the system was still firmly swap based: the address space was too small to usefully support demand paging, and it's not clear to me that the PDP-11 stored enough information when trapping to restart an instruction after a page fault. Even on the PDP-11, there was some sharing, though; cf the "sticky bit" for the text of frequently-run executables. When Unix was moved to the VAX, the address space was greatly expanded, and demand paging was added in 3BSD and in Reiser's elaboration of the work started with 32/V inside Bell Labs. To accommodate sharing of demand paged segments, a separate "page cache" was introduced, but this was essentially read-only, and was separate from the buffer cached, used for IO with open/close/read/write etc. Critically, entries in the page cache were page sized, while entries in the buffer cache were block sized, and the two weren't necessarily the same. With larger address spaces, programmers wanted to use shared memory as an alternative to traditional forms of IPC, like pipes, and the `mmap` _design_, based on the `PMAP` call on TENEX/TOPS-20, was presented with 4.2BSD, but not implemented. Sun did an independent implementation for SunOS as did many of the commercial Unix vendors; Berkeley eventually did an implementation in 4.4BSD (btw, there was some talk that Sun would donate their VM system to Berkeley for 4.4, but those talks fell through, so CSRG adopted the Mach VM, instead). By then, Linux had down their own as well. System V had its own shared memory stuff that was different, but the industry writ large more or less settled on mmap by the end of mid- to late-1990s. `mmap` gave programs a lot more control over the address space of the process they run in, which led to things like supporting shared libraries and so on. The Sun paper on doing that in SunOS4 is pretty interesting, btw; shared library support is basically implemented outside of the kernel, with a little cooperation from the C startup code in libc: http://mcvoy.com/lm/papers/SunOS.shlib.pdf But `mmap` worked in terms of the VM page cache, and with writable mappings, the page cache necessarily became read/write. But open/close/read/write continued to use the buffer cache, which was separate, which meant that a region of any given file might be present in both caches simultaneously, and with no interlock between the two, a write to one wouldn't necessarily update the other. If programs were using both for IO on a file simultaneously, the caches could easily fall out of sync and the state of the file on disk would be indeterminate. Unifying the two caches addressed this, and SunOS did that in the 1980s, but which took a long time to get right and ultimately wasn't done for everything (directories remained in the buffer cache but not the page cache). Something that aggravates some old time Sun engineers is that when ZFS was implemented in Solaris, it used its own cache (the ARC) that wasn't synchronized with the page cache, undoing much of the earlier work. The ZFS implementors mostly don't care, and note that using `mmap` for output is fraught: https://db.cs.cmu.edu/mmap-cidr2022/ When Plan 9 came along, the entire architecture changed. Sure, executables are faulted in on demand and things like `segattach` exist to support memory-mapped IO devices and the like, but there's no real equivalent to the full generality of `mmap`, and no shared libraries. here are legitimate questions about synchronization when mapping a file served by a remote server somewhere, and error handling is a challenge: how does one detect a write failure via a store into a mapped region? Presumably, that's reflected into an exception that's delivered to the process in the form of a note or something (if you really want to twist your noggin around, look at how Multics did it. Interestingly, the semantics of deleting a segment on Multics are closer to how Plan 9 deals with deleting a currently open file than unlink'ing a name on Unix, though of course that's dependent on the backing file server). But if error handling for a _store_ involves a round-trip to/from some distant machine, that can be painful. `mmap` is a lot easier to get right on a single machine with local storage and no networking at play and yes, Sun did it for NFS (to support running dynamically linked executables from NFS-mounted filesystems), but it's a pain and the semantics evolved informally, instead of being well-defined from the start. Bottom line: it's not trivial. Fortunately, if someone wants to make a go of it, there's close to 50 years worth of prior art to learn from. - Dan C. ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Me8f735d3c62aac435db5b793 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
