[consistent snapshot for local files. Not for remote]

> On Feb 16, 2026, at 11:40 AM, Bakul Shah <[email protected]> wrote:
> 
> Note that neither 9p nor nfs provide this guarantee.
> A large read() may be divided up in multiple 9p calls
> and another client can certainly modify the read data
> range in between 9p reads.
> 
> You are asking mmap to do more than what multiple reads
> would do. But if anything, mmap can indeed be implemented
> to provide a consistent snapshot of(unlike using multiple
> reads).
> 
>> On Feb 15, 2026, at 7:17 PM, Ori Bernstein <[email protected]> wrote:
>> 
>> The difficulty here is that having read mark a region
>> as paged in "later" delays the actual I/O, by which
>> time the file contents may have changed, and your
>> read returns incorrect results.
>> 
>> This idea can work if your OS has a page cache, the
>> data is already in the page cache, and you eagerly
>> read the data that is not loaded -- but the delayed
>> i/o semantics otherwise simply break.
>> 
>> fixing this would need deep filesystem-level help,
>> where the filesystem would need to take a snapshot
>> when the read is invoked, in order to prevent any
>> subsequent mutations from being visible to the reader.
>> 
>> (on most Plan 9 file systems, this per-file snapshot
>> is fairly expensive; on gefs, for example, this would
>> snapshot all files within the mount)
>> 
>> On Sun, 15 Feb 2026 21:24:32 -0500
>> "Alyssa M via 9fans" <[email protected]> wrote:
>> 
>>> I think the difficulty here is thinking about this as memory mapping. What 
>>> I'm really doing is deferred I/O. By the time a read completes, the read 
>>> has logically happened, it's just that not all of the data has been 
>>> transferred yet.
>>> That happens later as the buffer is examined, and if pages of the buffer 
>>> are not examined, it doesn't happen in those pages at all. 
>>> 
>>> My implementation (on my hobby OS) only does this in a custom segment type. 
>>> A segment of this type can be of any size, but is not pre-allocated pages 
>>> in memory or the swap file - I do this to allow it to be very large, and 
>>> because a read has to happen within the boundaries of a segment. I back it 
>>> with a file system temporary file, so when pages migrate to the swap area 
>>> the disk allocation can be sparse. You can load or store bytes anywhere in 
>>> this segment. Touching pages allocates them, first in memory and eventually 
>>> in the swap file as they get paged out.
>>> 
>>> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
>>>> but
>>> read/write work in terms of byte buffers that have no obligation to be
>>> byte aligned. Put another way, read and write relate the contents of a
>>> "file" with an arbitrarily sized and aligned byte-buffer in memory,
>>> but there is no obligation that those byte buffers have the properties
>>> required to be a "page" in the virtual memory sense.
>>> Understood. My current implementation does conventional I/O with any 
>>> fragments of pages at the beginning and end of the read/write buffers. So 
>>> small reads and writes happen traditionally. At the moment that's done 
>>> before the read completes, so your example of doing lots of adjacent reads 
>>> of small areas would work very badly (few pages would get the deferred 
>>> loading), but I think I can do better by deferring the fragment I/O, so 
>>> adjacent reads can coalesce the snapshots. My main scenario of interest 
>>> though is for very large reads and writes, because that's where the sparse 
>>> access has value.
>>> 
>>> Because reads are copies and not memory mapping, it doesn't matter if the 
>>> reads are not page-aligned. The process's memory pages are not being shared 
>>> with the cache of the file (snapshot), so if the data is not aligned then 
>>> page faults will copy bytes from two cached file blocks (assuming they're 
>>> the same size). In practice I'm expecting that large reads will be into 
>>> large allocations, which will be aligned, so there's an opportunity to 
>>> steal blocks from the file cache. But I'm not expecting to implement this. 
>>> There's no coherence problem here because the snapshot is private to the 
>>> process. And readonly.
>>> 
>>> When I do a read call into the segment, firstly a snapshot is made of the 
>>> data to be read. This is functionally equivalent to making a temporary file 
>>> and copying the data into it. Making this copy-on-write so the snapshot 
>>> costs nothing is a key part of this without which there would be no point.
>>> The pages of the read buffer in the segment are then associated with parts 
>>> of the snapshot - rather than the swap file. So rather than zero filling 
>>> (or reloading paged-out data) when a load instruction is executed, the 
>>> memory pages are filled from the snapshot.
>>> When a store instruction happens, the page becomes dirty, and loses its 
>>> association with the snapshot. It's then backed by the swap file. If you 
>>> alter all pages of the buffer, then all pages are disconnected from the 
>>> snapshot, and the snapshot is deleted. At that point you can't tell that 
>>> anything unconventional happened.
>>> If I 'read over' a buffer with something else, the pages get associated 
>>> with the new snapshot, and disassociated from the old one.
>>> 
>>> When I do a write call, the write call looks at each page, and decides 
>>> whether it is part of a snapshot. If it is, and we're writing back to the 
>>> same part of the same file (an update) and the corresponding block has not 
>>> been changed in the file, then the write call can skip that page. In other 
>>> cases it actually writes to the file. Any other writing to the file that we 
>>> made a snapshot from invokes the copy-on-write mechanism, so the file 
>>> changes, but the snapshot doesn't.
>>> 
>>> If you freed the read buffer memory, then parts of it might get demand 
>>> loaded in the act of writing malloc's book-keeping information into it - 
>>> depending on how the malloc works. If you later use calloc (or memset), it 
>>> will zero the memory, which will detach it all from the snapshot, albeit 
>>> loading every page from the snapshot as it goes...
>>> One could change calloc to read from /dev/zero for allocations over a 
>>> certain size, and special-case that to set up pages for zero-fill when it 
>>> happens in this type of segment, which would disassociate the pages from 
>>> the old snapshot without loading them, just as any other subsequent read 
>>> does. A memset syscall might be better. 
>>> Practically, though, I think malloc and free are not likely to be used in 
>>> this type of segment. You'd probably just detach the segment rather than 
>>> free parts of it, but I've illustrated how you could drop the deferred 
>>> snapshot if you needed to.
>>> 
>>> So this is not mmap by another name. It's an optimization of the standard 
>>> read/write approach that has some of the desirable characteristics of mmap. 
>>> In particular: it lets you do an arbitrarily large read call instantly, and 
>>> fault in just the pages you actually need as you need them. So like 
>>> demand-paging, but from a snapshot of a file. Similarly, if you're writing 
>>> back to the same file region, write will only write the pages that have 
>>> altered - either in memory or in the file. This is effectively an update, 
>>> somewhat like msync. 
>>> 
>>> It's different from mmap in some ways: the data read is always a copy of 
>>> the file contents, so there's never any spooky changing of memory under 
>>> your feet. The behaviour is not detectably different to the program from 
>>> the traditional implementation - except for where and if the time is spent.
>>> 
>>> There's still more I could add, but if I'm still not making sense, perhaps 
>>> I'd better stop there. I think I've ended up making it sound more 
>>> complicated than it is. 
>>> 
>>> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
>>>> since you give no reasons yourself, let me try to hallucinate a reason
>>> why you might be doing what you're doing here:
>>> 
>>> Here was my example for you:
>>> 
>>> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
>>>> I've built a couple of simple disk file systems. I thinking of taking the 
>>>> cache code out of one of them and mapping the whole file system image into 
>>>> the address space - to see how much it simplifies the code. I'm not 
>>>> expecting it will be faster.
>>> 
>>> This is interesting because it's a large data structure that's very 
>>> sparsely read or written. I'd read the entire file system image into the 
>>> segment in one gulp, respond to some file protocol requests (e.g. over 9P) 
>>> by treating the segment as a single data structure, and write the entire 
>>> image out periodically to implement what we used to call 'sync'.
>>> With traditional I/O that would be ridiculous. With the above mechanism it 
>>> should work about as well as mmap would. And without all that cache code 
>>> and block fetching. Which is the point of this.
>> 
>> 
>> --
>> Ori Bernstein <[email protected]>

------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M6e3e365f5f5c394c0f463fd7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to