The difficulty here is that having read mark a region as paged in "later" delays the actual I/O, by which time the file contents may have changed, and your read returns incorrect results.
This idea can work if your OS has a page cache, the data is already in the page cache, and you eagerly read the data that is not loaded -- but the delayed i/o semantics otherwise simply break. fixing this would need deep filesystem-level help, where the filesystem would need to take a snapshot when the read is invoked, in order to prevent any subsequent mutations from being visible to the reader. (on most Plan 9 file systems, this per-file snapshot is fairly expensive; on gefs, for example, this would snapshot all files within the mount) On Sun, 15 Feb 2026 21:24:32 -0500 "Alyssa M via 9fans" <[email protected]> wrote: > I think the difficulty here is thinking about this as memory mapping. What > I'm really doing is deferred I/O. By the time a read completes, the read has > logically happened, it's just that not all of the data has been transferred > yet. > That happens later as the buffer is examined, and if pages of the buffer are > not examined, it doesn't happen in those pages at all. > > My implementation (on my hobby OS) only does this in a custom segment type. A > segment of this type can be of any size, but is not pre-allocated pages in > memory or the swap file - I do this to allow it to be very large, and because > a read has to happen within the boundaries of a segment. I back it with a > file system temporary file, so when pages migrate to the swap area the disk > allocation can be sparse. You can load or store bytes anywhere in this > segment. Touching pages allocates them, first in memory and eventually in the > swap file as they get paged out. > > On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote: > > but > read/write work in terms of byte buffers that have no obligation to be > byte aligned. Put another way, read and write relate the contents of a > "file" with an arbitrarily sized and aligned byte-buffer in memory, > but there is no obligation that those byte buffers have the properties > required to be a "page" in the virtual memory sense. > Understood. My current implementation does conventional I/O with any > fragments of pages at the beginning and end of the read/write buffers. So > small reads and writes happen traditionally. At the moment that's done before > the read completes, so your example of doing lots of adjacent reads of small > areas would work very badly (few pages would get the deferred loading), but I > think I can do better by deferring the fragment I/O, so adjacent reads can > coalesce the snapshots. My main scenario of interest though is for very large > reads and writes, because that's where the sparse access has value. > > Because reads are copies and not memory mapping, it doesn't matter if the > reads are not page-aligned. The process's memory pages are not being shared > with the cache of the file (snapshot), so if the data is not aligned then > page faults will copy bytes from two cached file blocks (assuming they're the > same size). In practice I'm expecting that large reads will be into large > allocations, which will be aligned, so there's an opportunity to steal blocks > from the file cache. But I'm not expecting to implement this. There's no > coherence problem here because the snapshot is private to the process. And > readonly. > > When I do a read call into the segment, firstly a snapshot is made of the > data to be read. This is functionally equivalent to making a temporary file > and copying the data into it. Making this copy-on-write so the snapshot costs > nothing is a key part of this without which there would be no point. > The pages of the read buffer in the segment are then associated with parts of > the snapshot - rather than the swap file. So rather than zero filling (or > reloading paged-out data) when a load instruction is executed, the memory > pages are filled from the snapshot. > When a store instruction happens, the page becomes dirty, and loses its > association with the snapshot. It's then backed by the swap file. If you > alter all pages of the buffer, then all pages are disconnected from the > snapshot, and the snapshot is deleted. At that point you can't tell that > anything unconventional happened. > If I 'read over' a buffer with something else, the pages get associated with > the new snapshot, and disassociated from the old one. > > When I do a write call, the write call looks at each page, and decides > whether it is part of a snapshot. If it is, and we're writing back to the > same part of the same file (an update) and the corresponding block has not > been changed in the file, then the write call can skip that page. In other > cases it actually writes to the file. Any other writing to the file that we > made a snapshot from invokes the copy-on-write mechanism, so the file > changes, but the snapshot doesn't. > > If you freed the read buffer memory, then parts of it might get demand loaded > in the act of writing malloc's book-keeping information into it - depending > on how the malloc works. If you later use calloc (or memset), it will zero > the memory, which will detach it all from the snapshot, albeit loading every > page from the snapshot as it goes... > One could change calloc to read from /dev/zero for allocations over a certain > size, and special-case that to set up pages for zero-fill when it happens in > this type of segment, which would disassociate the pages from the old > snapshot without loading them, just as any other subsequent read does. A > memset syscall might be better. > Practically, though, I think malloc and free are not likely to be used in > this type of segment. You'd probably just detach the segment rather than free > parts of it, but I've illustrated how you could drop the deferred snapshot if > you needed to. > > So this is not mmap by another name. It's an optimization of the standard > read/write approach that has some of the desirable characteristics of mmap. > In particular: it lets you do an arbitrarily large read call instantly, and > fault in just the pages you actually need as you need them. So like > demand-paging, but from a snapshot of a file. Similarly, if you're writing > back to the same file region, write will only write the pages that have > altered - either in memory or in the file. This is effectively an update, > somewhat like msync. > > It's different from mmap in some ways: the data read is always a copy of the > file contents, so there's never any spooky changing of memory under your > feet. The behaviour is not detectably different to the program from the > traditional implementation - except for where and if the time is spent. > > There's still more I could add, but if I'm still not making sense, perhaps > I'd better stop there. I think I've ended up making it sound more complicated > than it is. > > On Sunday, February 15, 2026, at 10:19 AM, hiro wrote: > > since you give no reasons yourself, let me try to hallucinate a reason > why you might be doing what you're doing here: > > Here was my example for you: > > On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote: > > I've built a couple of simple disk file systems. I thinking of taking the > > cache code out of one of them and mapping the whole file system image into > > the address space - to see how much it simplifies the code. I'm not > > expecting it will be faster. > > This is interesting because it's a large data structure that's very sparsely > read or written. I'd read the entire file system image into the segment in > one gulp, respond to some file protocol requests (e.g. over 9P) by treating > the segment as a single data structure, and write the entire image out > periodically to implement what we used to call 'sync'. > With traditional I/O that would be ridiculous. With the above mechanism it > should work about as well as mmap would. And without all that cache code and > block fetching. Which is the point of this. -- Ori Bernstein <[email protected]> ------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M7294b53b05c66b76159a66b3 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
