Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

Darrick J. Wong Thu, 22 Jun 2017 00:24:13 -0700

On Thu, Jun 22, 2017 at 09:37:14AM +1000, Dave Chinner wrote:
> On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> > [add linux-xfs to the fray]
> > 
> > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> > > + spin_lock(&dax_lock);
> > > + list_add(&d->list, &daxfiles);
> > > + spin_unlock(&dax_lock);
> > > +
> > > + /*
> > > +  * We set S_SWAPFILE to gain "no truncate" / static block
> > > +  * allocation semantics, and S_DAXFILE so we can differentiate
> > > +  * traditional swapfiles and assume static block mappings in the
> > > +  * dax mmap path.
> > > +  */
> > > + inode->i_flags |= S_SWAPFILE | S_DAXFILE;
> > 
> > Yikes.  You know, I hadn't even thought about considering swap files as
> > a subcase of files with immutable block maps, but here we are.  Both
> > swap files and DAX require absolutely stable block mappings, they are
> > both (probably) intolerant of inode metadata changes (size, mtime, etc.)
> 
> Swap files are intolerant of any metadata changes because once the
> mapping has been sucked into the swapfile code, the inode is never
> looked at again. DAX file data access always goes through the inode,
> so they is much more tolerant of metadata changes given certain
> constraints.


Fair enough.

> <snip bmap rant>
> 
> > Honestly, I realize we've gone back, forth, and around all over the
> > place on this.  I still prefer something similar to a permanent flag,
> > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
> > and some of the semantics.
> > 
> > First, a new inode flag S_IOMAP_FROZEN that means the file's block map
> > cannot change.
> 
> I've been calling it "immutable extents" - freezing has implications
> that it's only temporary (i.e. freezing filesystems) and will be
> followed shortly by a thaw. That isn't the case here - we truly want
> the extent/block map to be immutable....

<nod> S_IOMAP_IMMUTABLE it is, then.

> > Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
> > Turning it on will lock the inode, check the extent map for holes,
> > shared, or unwritten bits, and bail out if it finds any, or set the
> > flag. 
> 
> Hmmm, I disagree on the unwritten state here.  We want swap files to
> be able to use unwritten extents - it means we can preallocate the
> swap file and hand it straight to swapon without having to zero it
> (i.e. no I/O needed to demand allocate more swap space when memory
> is very low).  Also, anyone who tries to read the swap file from
> userspace will be reading unwritten extents, which will always
> return zeros rather than whatever is in the swap file...

Now I've twisted all the way around to thinking that swap files
should be /totally/ unwritten, except for the file header. :)

> > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
> > yes, at least at first.  I don't currently have any objection to writing
> > non-iomap inode metadata out to disk.
> > 
> > Third, the flag can only be cleared if the file isn't mapped.
> 
> How do we check this from the fs without racing? AFAICT we can't
> prevent a concurrent map operation from occurring while we are
> changing the state of the inode - we can only block page faults
> after then inode is mapped....

I'd thought we could coordinate that via xfs_file_mmap, but tbh my brain
paged that out a while ago.

> > Fourth, the VFS entry points for things like read, write, truncate,
> > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> > file, so that the block map cannot be modified.
> > mmap is still allowed,
> > as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> > with zeroed extents (it will be slow) as I've heard murmurs about
> > wanting to be able to extend a file, maybe not.
> 
> read is fine, write should be fine as long as the iomap call can
> error out operations that would require extent map modifications.

Ok.

> fallocate should be allowed to modify the extent map, too, because
> it should be the mechanism used be applications to set up file
> extents in the correct form for applications to use as immutable
> (i.e. lock out page faults, allocate, zero, extend and fsync in
> one atomic operation)....

<nod>

> > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
> > stable iomap but probably don't care about things like mtime.  Maybe
> > they can call iomap too.
> > 
> > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
> > whenever the in-core inode gets constructed.  This enables us to
> > prohibit reflinking and other such undesirable activity.
> 
> *nod*
> 
> > If we actually /do/ come up with a reference implementation for XFS, I'd
> > be ok with tacking it on the end of my dev branch, which will give us a
> > loooong runway to try it out.  The end of the dev branch is beyond
> > online XFS fsck and repair and the "root metadata btrees in inodes"
> > rework; since that's ~90 patches with my name on it that I cannot also
> > review, it won't go in for a long time indeed!
> 
> I don't think it's so complex to need such a long dev time -
> all the infrastructure we need is pretty much there already...

It definitely is, but we've been bikeshedding so long it now has
momentum. :)

That said, it (and having a go at dax+reflink) are things that I'd like
to look at (after pushing the scrub stuff) for the rest of the year.

> > (Yes, that was also sort of a plea for someone to go review the XFS
> > scrub patches.)
> > 
> > > + return 0;
> > > +}
> > > +
> > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, 
> > > align)
> > 
> > I was /about/ to grouse about this syscall, then realized that maybe it
> > /is/ useful to be able to check a specific alignment.  Maybe not, since
> > I had something more permanent in mind anyway.  In any case, just pass
> > in an opened fd if this sticks around.
> 
> We can do all that via fallocate(), too...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

Reply via email to