On 12/5/18 9:37 PM, Jeff Mahoney wrote:
The high level idea that Jan Kara and I came up with in our conversation at Labs conf is pretty expensive.  We'd need to set a flag that pauses new page faults, set the WP bit on affected ranges, do the snapshot, commit, clear the flag, and wake up the waiting threads.  Neither of us had any concrete idea of how well that would perform and it still depends on finding a good way to resolve all open mmap ranges on a subvolume.  Perhaps using the address_space->private_list anchored on each root would work.

This is a potentially wild idea, so "grain of salt" and all that. I may misuse the exact wording.

So the essential problem of DAX is basically the opposite of data-deduplication. Instead of merging two duplicate data regions, you want to mark regions as at-risk while keeping the original content intact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirror or one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn't have before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that the block has been updated.

At this point you really just need to do the opposite of deduplication, as in find/recover the original contents and assign/leave assigned those to the old/other snapshots, then compute the new checksum on the "original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greater than one, and you get to a sync or close event, you "recover" the old contents into a new location and assign those to "all the other users". Now that original storage region has only one user, so on sync or close you fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug to make a lump, the new data is like a rock being slid under the rug to make a lump.

So the first write to an extent creates a burdensome copy to retain the old contents, but second and subsequent writes to the same extent only have the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or the DAX open or the setting of the S_DUP flag could force the file into an extent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong to the new state being created. The new blocks can be associated to the snapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greater than one, so at worst taking a snapshot, which is already a _little_ racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:

If S_DAX :
  If checksum invalid :
    copy data as-is and checksum, store in snapshot
  else : look for duplicate checksum
    if duplicate found :
      assign that extent to the snapshot
    else :
      If file opened for writing and has any mmaps for write :
        copy extent and assign to new snapshot.
      else :
        increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is basically about amortizing the copy costs into the task of maintaining the snapshots instead of leaving them in the immediate hands of the time-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event, are also good points to expense the bookeeping time.

Reply via email to