On Wed, Feb 5, 2014 at 12:04 PM, Glauber Costa <[email protected] > wrote:
> > > > On Wed, Feb 5, 2014 at 10:03 PM, Matthew Ahrens <[email protected]>wrote: > >> On Wed, Feb 5, 2014 at 2:31 AM, Glauber Costa < >> [email protected]> wrote: >> >>> Hi >>> >>> I've been recently trying to devise some mechanism to reuse the ARC >>> buffers directly into a file back mapping (created by mmap, for instance). >>> My main goal is not to have duplication between what is in the ARC and what >>> is in the page cache (and to be honest, the OS I am working on does not >>> have a page cache, so my real goal is to keep it this way). >>> >>> It seems like Solaris and BSD never did that, but I could not find any >>> indication about the why. >>> That's the kind of thing I am pretty sure was thought about before, so I >>> wonder if the lack of an implementation like that is due to a major >>> showstopper found by you guys. >>> >>> So before I dive too deeply into this, can anybody advise me on this? >>> >>> >> Thanks > > >> As I recall, the main reason we kept the ZFS cache separate from the page >> cache was to avoid complexity related to the different locking models. If >> you are designing mmap from scratch, I imagine you could avoid that. >> > > Most definitely. This does not exist yet for me outside of a paper sheet, > but as I am evolving with it, one of the main problems I am foreseeing is > that every ARC access is intermediated by a read or write operation that > can call arc_access in a very well defined point. If this buffer is used > outside of the ARC, those accesses won't exist. Specially for mmap, that > information is held in the processor acessed / dirty bits (for the case of > moving to anonymous), and periodically synchronizing the whole memory can > be prohibitive - although I am talking dozens Gbs of memory here, hundreds > is a bit out of our scope. > I was imagining that the dbuf is held as long as the page is mapped, and thus the arc_buf is not evictable. So you wouldn't need to worry about the accessed bits. Maybe just call arc_access() when it is unmapped. Handling the dirty bit is more complicated. > > Did you guys give this any previous thought ? > > My proposed solution so far is to every time a new page in inserted into > the cache, verify the bits in the page that it would dislodge and update > accordingly if needed. Same goes for eviction, especially eviction to ghost > list. > > > >> Read-only mmap should be relatively straightforward. When the page is >> faulted in you can just keep the dbuf (dmu_buf_impl_t) held, so that it >> stays in memory, and then find its page_t and map it into the process's >> address space. >> > > Actually, I don't want it to stay in memory. One of the big wins for me to > do this is to be able to re-use ZFS's paging policy instead of implementing > our own. We only ever want to do paging for file back shared mappings, so > we have no other kind of paging. What I intend to do is to have ZFS to tell > the OS when the page is about to be taken out from the cache, and then > allow me to get rid of the present bit. But that still seems doable given > what you said, as long as I do all that updates with the buffer still held > (doesn't seem a problem) > > >> >> Write-back mmap (i.e. PROT_WRITE + MAP_SHARED) will be trickier, because >> you can't modify the page while the dbuf is being written (due to >> checksums, raid-z, etc). Nor can you have transactions of indefinite >> length (e.g. create transaction when page is first stored to, commit it >> when pageout gets around to flushing it). I guess you could do something >> like mark the dbuf dirty and then when syncing context (dbuf_sync_leaf()) >> gets around to writing it, copy the data from the page to a new arc_buf >> that's just only while writing it out. >> >> Last case I can COW on write and implement a writeback mechanism that > allows me to simplify it to keep the mapping with clean pages only (re-read > them upon writeback completion). But that would be less desirable. > > Is that limitation about not being able to modify the dbuf only valid for > the period in which IO is initiated? That seems possible to overcome as > well, although we'd be already out of the trivial zone. > I think you would want to do something like: in dbuf_sync_leaf(), mark the page clean, then copy the data from the page to a new arc_buf, and send that arc_buf down to zio_write(). When the i/o is done you can free the arc_buf. This could be several seconds, but you should be able to continue modifying the page while this is happening (because we made the copy). --matt
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
