[consistent snapshot for local files. Not for remote] > On Feb 16, 2026, at 11:40 AM, Bakul Shah <[email protected]> wrote: > > Note that neither 9p nor nfs provide this guarantee. > A large read() may be divided up in multiple 9p calls > and another client can certainly modify the read data > range in between 9p reads. > > You are asking mmap to do more than what multiple reads > would do. But if anything, mmap can indeed be implemented > to provide a consistent snapshot of(unlike using multiple > reads). > >> On Feb 15, 2026, at 7:17 PM, Ori Bernstein <[email protected]> wrote: >> >> The difficulty here is that having read mark a region >> as paged in "later" delays the actual I/O, by which >> time the file contents may have changed, and your >> read returns incorrect results. >> >> This idea can work if your OS has a page cache, the >> data is already in the page cache, and you eagerly >> read the data that is not loaded -- but the delayed >> i/o semantics otherwise simply break. >> >> fixing this would need deep filesystem-level help, >> where the filesystem would need to take a snapshot >> when the read is invoked, in order to prevent any >> subsequent mutations from being visible to the reader. >> >> (on most Plan 9 file systems, this per-file snapshot >> is fairly expensive; on gefs, for example, this would >> snapshot all files within the mount) >> >> On Sun, 15 Feb 2026 21:24:32 -0500 >> "Alyssa M via 9fans" <[email protected]> wrote: >> >>> I think the difficulty here is thinking about this as memory mapping. What >>> I'm really doing is deferred I/O. By the time a read completes, the read >>> has logically happened, it's just that not all of the data has been >>> transferred yet. >>> That happens later as the buffer is examined, and if pages of the buffer >>> are not examined, it doesn't happen in those pages at all. >>> >>> My implementation (on my hobby OS) only does this in a custom segment type. >>> A segment of this type can be of any size, but is not pre-allocated pages >>> in memory or the swap file - I do this to allow it to be very large, and >>> because a read has to happen within the boundaries of a segment. I back it >>> with a file system temporary file, so when pages migrate to the swap area >>> the disk allocation can be sparse. You can load or store bytes anywhere in >>> this segment. Touching pages allocates them, first in memory and eventually >>> in the swap file as they get paged out. >>> >>> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote: >>>> but >>> read/write work in terms of byte buffers that have no obligation to be >>> byte aligned. Put another way, read and write relate the contents of a >>> "file" with an arbitrarily sized and aligned byte-buffer in memory, >>> but there is no obligation that those byte buffers have the properties >>> required to be a "page" in the virtual memory sense. >>> Understood. My current implementation does conventional I/O with any >>> fragments of pages at the beginning and end of the read/write buffers. So >>> small reads and writes happen traditionally. At the moment that's done >>> before the read completes, so your example of doing lots of adjacent reads >>> of small areas would work very badly (few pages would get the deferred >>> loading), but I think I can do better by deferring the fragment I/O, so >>> adjacent reads can coalesce the snapshots. My main scenario of interest >>> though is for very large reads and writes, because that's where the sparse >>> access has value. >>> >>> Because reads are copies and not memory mapping, it doesn't matter if the >>> reads are not page-aligned. The process's memory pages are not being shared >>> with the cache of the file (snapshot), so if the data is not aligned then >>> page faults will copy bytes from two cached file blocks (assuming they're >>> the same size). In practice I'm expecting that large reads will be into >>> large allocations, which will be aligned, so there's an opportunity to >>> steal blocks from the file cache. But I'm not expecting to implement this. >>> There's no coherence problem here because the snapshot is private to the >>> process. And readonly. >>> >>> When I do a read call into the segment, firstly a snapshot is made of the >>> data to be read. This is functionally equivalent to making a temporary file >>> and copying the data into it. Making this copy-on-write so the snapshot >>> costs nothing is a key part of this without which there would be no point. >>> The pages of the read buffer in the segment are then associated with parts >>> of the snapshot - rather than the swap file. So rather than zero filling >>> (or reloading paged-out data) when a load instruction is executed, the >>> memory pages are filled from the snapshot. >>> When a store instruction happens, the page becomes dirty, and loses its >>> association with the snapshot. It's then backed by the swap file. If you >>> alter all pages of the buffer, then all pages are disconnected from the >>> snapshot, and the snapshot is deleted. At that point you can't tell that >>> anything unconventional happened. >>> If I 'read over' a buffer with something else, the pages get associated >>> with the new snapshot, and disassociated from the old one. >>> >>> When I do a write call, the write call looks at each page, and decides >>> whether it is part of a snapshot. If it is, and we're writing back to the >>> same part of the same file (an update) and the corresponding block has not >>> been changed in the file, then the write call can skip that page. In other >>> cases it actually writes to the file. Any other writing to the file that we >>> made a snapshot from invokes the copy-on-write mechanism, so the file >>> changes, but the snapshot doesn't. >>> >>> If you freed the read buffer memory, then parts of it might get demand >>> loaded in the act of writing malloc's book-keeping information into it - >>> depending on how the malloc works. If you later use calloc (or memset), it >>> will zero the memory, which will detach it all from the snapshot, albeit >>> loading every page from the snapshot as it goes... >>> One could change calloc to read from /dev/zero for allocations over a >>> certain size, and special-case that to set up pages for zero-fill when it >>> happens in this type of segment, which would disassociate the pages from >>> the old snapshot without loading them, just as any other subsequent read >>> does. A memset syscall might be better. >>> Practically, though, I think malloc and free are not likely to be used in >>> this type of segment. You'd probably just detach the segment rather than >>> free parts of it, but I've illustrated how you could drop the deferred >>> snapshot if you needed to. >>> >>> So this is not mmap by another name. It's an optimization of the standard >>> read/write approach that has some of the desirable characteristics of mmap. >>> In particular: it lets you do an arbitrarily large read call instantly, and >>> fault in just the pages you actually need as you need them. So like >>> demand-paging, but from a snapshot of a file. Similarly, if you're writing >>> back to the same file region, write will only write the pages that have >>> altered - either in memory or in the file. This is effectively an update, >>> somewhat like msync. >>> >>> It's different from mmap in some ways: the data read is always a copy of >>> the file contents, so there's never any spooky changing of memory under >>> your feet. The behaviour is not detectably different to the program from >>> the traditional implementation - except for where and if the time is spent. >>> >>> There's still more I could add, but if I'm still not making sense, perhaps >>> I'd better stop there. I think I've ended up making it sound more >>> complicated than it is. >>> >>> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote: >>>> since you give no reasons yourself, let me try to hallucinate a reason >>> why you might be doing what you're doing here: >>> >>> Here was my example for you: >>> >>> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote: >>>> I've built a couple of simple disk file systems. I thinking of taking the >>>> cache code out of one of them and mapping the whole file system image into >>>> the address space - to see how much it simplifies the code. I'm not >>>> expecting it will be faster. >>> >>> This is interesting because it's a large data structure that's very >>> sparsely read or written. I'd read the entire file system image into the >>> segment in one gulp, respond to some file protocol requests (e.g. over 9P) >>> by treating the segment as a single data structure, and write the entire >>> image out periodically to implement what we used to call 'sync'. >>> With traditional I/O that would be ridiculous. With the above mechanism it >>> should work about as well as mmap would. And without all that cache code >>> and block fetching. Which is the point of this. >> >> >> -- >> Ori Bernstein <[email protected]>
------------------------------------------ 9fans: 9fans Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M6e3e365f5f5c394c0f463fd7 Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
