The difficulty here is that having read mark a region
as paged in "later" delays the actual I/O, by which
time the file contents may have changed, and your
read returns incorrect results.

This idea can work if your OS has a page cache, the
data is already in the page cache, and you eagerly
read the data that is not loaded -- but the delayed
i/o semantics otherwise simply break.

fixing this would need deep filesystem-level help,
where the filesystem would need to take a snapshot
when the read is invoked, in order to prevent any
subsequent mutations from being visible to the reader.

(on most Plan 9 file systems, this per-file snapshot
is fairly expensive; on gefs, for example, this would
snapshot all files within the mount)

On Sun, 15 Feb 2026 21:24:32 -0500
"Alyssa M via 9fans" <[email protected]> wrote:

> I think the difficulty here is thinking about this as memory mapping. What 
> I'm really doing is deferred I/O. By the time a read completes, the read has 
> logically happened, it's just that not all of the data has been transferred 
> yet.
> That happens later as the buffer is examined, and if pages of the buffer are 
> not examined, it doesn't happen in those pages at all. 
> 
> My implementation (on my hobby OS) only does this in a custom segment type. A 
> segment of this type can be of any size, but is not pre-allocated pages in 
> memory or the swap file - I do this to allow it to be very large, and because 
> a read has to happen within the boundaries of a segment. I back it with a 
> file system temporary file, so when pages migrate to the swap area the disk 
> allocation can be sparse. You can load or store bytes anywhere in this 
> segment. Touching pages allocates them, first in memory and eventually in the 
> swap file as they get paged out.
> 
> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
> > but
> read/write work in terms of byte buffers that have no obligation to be
> byte aligned. Put another way, read and write relate the contents of a
> "file" with an arbitrarily sized and aligned byte-buffer in memory,
> but there is no obligation that those byte buffers have the properties
> required to be a "page" in the virtual memory sense.
> Understood. My current implementation does conventional I/O with any 
> fragments of pages at the beginning and end of the read/write buffers. So 
> small reads and writes happen traditionally. At the moment that's done before 
> the read completes, so your example of doing lots of adjacent reads of small 
> areas would work very badly (few pages would get the deferred loading), but I 
> think I can do better by deferring the fragment I/O, so adjacent reads can 
> coalesce the snapshots. My main scenario of interest though is for very large 
> reads and writes, because that's where the sparse access has value.
> 
> Because reads are copies and not memory mapping, it doesn't matter if the 
> reads are not page-aligned. The process's memory pages are not being shared 
> with the cache of the file (snapshot), so if the data is not aligned then 
> page faults will copy bytes from two cached file blocks (assuming they're the 
> same size). In practice I'm expecting that large reads will be into large 
> allocations, which will be aligned, so there's an opportunity to steal blocks 
> from the file cache. But I'm not expecting to implement this. There's no 
> coherence problem here because the snapshot is private to the process. And 
> readonly.
> 
> When I do a read call into the segment, firstly a snapshot is made of the 
> data to be read. This is functionally equivalent to making a temporary file 
> and copying the data into it. Making this copy-on-write so the snapshot costs 
> nothing is a key part of this without which there would be no point.
> The pages of the read buffer in the segment are then associated with parts of 
> the snapshot - rather than the swap file. So rather than zero filling (or 
> reloading paged-out data) when a load instruction is executed, the memory 
> pages are filled from the snapshot.
> When a store instruction happens, the page becomes dirty, and loses its 
> association with the snapshot. It's then backed by the swap file. If you 
> alter all pages of the buffer, then all pages are disconnected from the 
> snapshot, and the snapshot is deleted. At that point you can't tell that 
> anything unconventional happened.
> If I 'read over' a buffer with something else, the pages get associated with 
> the new snapshot, and disassociated from the old one.
> 
> When I do a write call, the write call looks at each page, and decides 
> whether it is part of a snapshot. If it is, and we're writing back to the 
> same part of the same file (an update) and the corresponding block has not 
> been changed in the file, then the write call can skip that page. In other 
> cases it actually writes to the file. Any other writing to the file that we 
> made a snapshot from invokes the copy-on-write mechanism, so the file 
> changes, but the snapshot doesn't.
> 
> If you freed the read buffer memory, then parts of it might get demand loaded 
> in the act of writing malloc's book-keeping information into it - depending 
> on how the malloc works. If you later use calloc (or memset), it will zero 
> the memory, which will detach it all from the snapshot, albeit loading every 
> page from the snapshot as it goes...
> One could change calloc to read from /dev/zero for allocations over a certain 
> size, and special-case that to set up pages for zero-fill when it happens in 
> this type of segment, which would disassociate the pages from the old 
> snapshot without loading them, just as any other subsequent read does. A 
> memset syscall might be better. 
> Practically, though, I think malloc and free are not likely to be used in 
> this type of segment. You'd probably just detach the segment rather than free 
> parts of it, but I've illustrated how you could drop the deferred snapshot if 
> you needed to.
> 
> So this is not mmap by another name. It's an optimization of the standard 
> read/write approach that has some of the desirable characteristics of mmap. 
> In particular: it lets you do an arbitrarily large read call instantly, and 
> fault in just the pages you actually need as you need them. So like 
> demand-paging, but from a snapshot of a file. Similarly, if you're writing 
> back to the same file region, write will only write the pages that have 
> altered - either in memory or in the file. This is effectively an update, 
> somewhat like msync. 
> 
> It's different from mmap in some ways: the data read is always a copy of the 
> file contents, so there's never any spooky changing of memory under your 
> feet. The behaviour is not detectably different to the program from the 
> traditional implementation - except for where and if the time is spent.
> 
> There's still more I could add, but if I'm still not making sense, perhaps 
> I'd better stop there. I think I've ended up making it sound more complicated 
> than it is. 
> 
> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
> > since you give no reasons yourself, let me try to hallucinate a reason
> why you might be doing what you're doing here:
> 
> Here was my example for you:
> 
> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
> > I've built a couple of simple disk file systems. I thinking of taking the 
> > cache code out of one of them and mapping the whole file system image into 
> > the address space - to see how much it simplifies the code. I'm not 
> > expecting it will be faster.
> 
> This is interesting because it's a large data structure that's very sparsely 
> read or written. I'd read the entire file system image into the segment in 
> one gulp, respond to some file protocol requests (e.g. over 9P) by treating 
> the segment as a single data structure, and write the entire image out 
> periodically to implement what we used to call 'sync'.
> With traditional I/O that would be ridiculous. With the above mechanism it 
> should work about as well as mmap would. And without all that cache code and 
> block fetching. Which is the point of this.


-- 
Ori Bernstein <[email protected]>

------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M7294b53b05c66b76159a66b3
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to