Thanks Joe, great writeup.  Maybe this should go in thin-provisioning.txt.

More below:

On Wed, 4 Dec 2019, Joe Thornber wrote:

> (These notes are for my own benefit as much as anything, I haven't
> worked on this for a couple of years and will forget it all completely
> if I don't write it down somewhere).
> 
> Let's start by writing some pseudocode for what the remap function for
> thin provisioning actually does.
> 
>       ----------------------------------------------------------
>       -- Metadata
> 
>       newtype ThinId = Int
>       data Bio = Bio
>       newtype VBlock = Integer        -- virtual block
>       newtype DBlock = Integer        -- data block

Can you define virtual block vs data block?  Is this just the thin volume 
offset vs the tdata offset?

> 
>       data LookupResult =
>           Unprovisioned |
>           Provisioned { lrDBlock :: DataBlock,
>                         lrShared :: Bool
>                       }

What is "lr"? as in lrDBlock?

>       metadataLookup :: ThinId -> VBlock -> Task LookupResult
>       metadataLookup = undefined
> 
>       metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
>       metadataInsert = undefined
> 
>       metadataRemove :: ThinId -> VBlock -> Task ()
>       metadataRemove = undefined
> 
>       -- Blocks all other metadata operations while running
>       metadataCommit :: Task ()
>       metadataCommit = undefined
> 
>       ----------------------------------------------------------
>       -- Tasks
> 
>       -- Many steps of servicing a bio can block.  eg, taking a lock,
>       -- reading metadata, updating metadata, zeroing data, copying
>       -- data ...
>       -- So we completely side step the issue in this pseudocode by
>       -- running everything in a magic light weight thread.
>       spawn :: Task () -> IO ()
>       spawn = undefined
> 
>       ----------------------------------------------------------
>       -- Locking
> 
>       -- These 'with' primitives acquire a lock (can block of course), perform
>       -- an action, and then automatically release the lock.
>        
>       -- Shared lock can be upgraded, so we need to pass the lock into
>       -- the action body.
>       withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
>       withSharedLock thinId vblock actionFn = undefined
> 
>       withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
>       withExclusiveLock thinId vblock action = undefined
> 
>       -- This promotes a shared lock to exclusive.
>       withUpgradedLock :: Lock -> Task () -> Task ()
>       withUpgradedLock lock action = undefined
> 
>       -- Data locks are always exclusive
>       withDataLock :: DBlock -> Task () -> Task ()
>       withDataLock dblock action = undefined
> 
>       ----------------------------------------------------------
> 
>       -- Top level remap function.  Kicks off a green thread for each bio.
>       -- How we handle a bio depends on whether it's a read, write, discard
>       -- or flush bio.  Whether the block is already provisioned, and if so
>       -- whether it is shared between snapshots.
>       remap :: ThinId -> Bio -> IO ()
>       remap thinId bio = spawn $ remapFn thinId bio vblock
>           where
>               vblock = virtBlock bio
>               remapFn = case classifyBio bio of
>                   ReadBio -> remapRead
>                   WriteBio -> remapWrite
>                   DiscardBio -> remapDiscard
>                   FlushBio -> remapFlush
> 
>       ----------------------------------------------------------
> 
>       remapRead :: ThinId -> Bio -> VBlock -> Task ()
>       remapRead thinId bio vblock = do
>           withSharedLock thinId vblock $ \_ -> do
>               lr <- metadataLookup thinId vblock
>               case lr of
>                   -- Read, Unprovisioned, Shared/!Shared
>                   Unprovisioned -> do
>                       fillWithZeroes bio
>                       complete bio Success
> 
>                   -- Read, Provisioned, Shared/!Shared
>                   (Provisioned dblock _) ->
>                       remapAndWait bio dblock
> 
>       ----------------------------------------------------------
> 
>       remapWrite :: ThinId -> Bio -> VBlock -> Task ()
>       remapWrite thinId bio vblock = do
>           withSharedLock thinId vblock $ \lock -> do
>               lr <- metadataLookup thinId vblock
>               case lr of
>                   -- Write, Unprovisioned
>                   Unprovisioned ->
>                       withUpgradedLock lock $
>                           provision thinId bio vblock
> 
>                   -- Write, Provisioned, !Shared
>                   (Provisioned dblock False) ->
>                       remapAndWait bio dblock
> 
>                   -- Write, Provisioned, Shared
>                   (Provisioned dblock True) ->
>                       withUpgradedLock lock $
>                           breakSharing thinId bio vblock dblock
> 
>       breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
>       breakSharing thinId bio vblock dblockOld = do
>           ab <- allocateBlock
>          case ab of
>              NoDataSpace ->
>                  complete bio Failure
> 
>              (Allocated dblockNew) ->
>                  withDataLock dblockOld $             -- we grab data locks 
> to avoid races with discard
>                      withDataLock dblockNew $ do
>                          copy dblockOld dblockNew
>                          metadataInsert thinId vblock dblockNew
>                  remapAndWait thinId bio dblockNew
> 
>       provision :: ThinId -> Bio -> VBlock -> Task ()
>       provision thinId bio vblock = do
>           case allocateBlock of
>               NoDataSpace ->
>                   complete bio Failure
> 
>               (Allocated dblock) ->
>                   withDataLock dblock $ do
>                       metadataInsert thinId vblock dblock
>                       remapAndWait thinId bio dblock

Does the allocator block?  If so, it would be neat to pre-allocate some 
number of blocks during metadata idle times.  There could be a hidden thin 
volume (ie, devid #16777215) that blocks are allocated into and then 
stolen from for use elsewhere.  The blocks could be pre-zeroed, too!

>                   
>       ----------------------------------------------------------
> 
>       discard :: ThinId -> Bio -> VBlock -> Task ()
>       discard thinId bio vblock = do
>           withExclusiveLock thinId vblock $ do
>               lr <- metadataLookup thinId vblock
>               case lr of
>                   -- Discard, Unprovisioned
>                   Unprovisioned ->
>                       complete bio Success
> 
>                   -- Discard, Provisioned, !Shared
>                   (Provisioned dblock False) ->
>                       withDataLock dblock $ do
>                           remapAndWait bio dblock             -- passdown
>                           metadataRemove thinId dblock
> 
>                   -- Discard, Provisioned, Shared
>                  (Provisioned dblock True) ->
>                      withDataLock dblock $ do
>                          metadataRemove thinId dblock
>                          complete bio Success
> 
>       ----------------------------------------------------------
> 
>       flush :: Task ()
>       flush = metadataCommit
>           
>       ----------------------------------------------------------
> 
>       remapAndWait :: Bio -> DataBlock -> Task ()
>       remapAndWait bio dblock = do
>           remap bio dblock
>           issue bio
>           wait bio
>    
> The above is a simplification (eg, discards can cover more than a single
> block, the pool has multiple modes like OUT_OF_DATA_SPACE).  But it gives
> a good idea of what the dm target needs to do, and in a succinct manner.
> 
> Now dm-thin.c is anything but succinct, for a couple of reasons:
> 
> - Because of where we are in the IO stack we cannot allocate memory.
>   This means we either use memory preallocated via mempools, or allocate
>   a fixed size block before a bio is processed.
> 
> - We don't have a magic green threads library that hides the numerous
>   blocking operations that we need.  Instead we have low level facilities
>   like workqueues etc.  This tends to have the effect of breaking up the logic
>   and scattering it across lots of little completion functions.
> 
> 
> How we handle blocking, locking, and quiescing IO are all intertwined.
> Which is why switching over to the new bio_prison interface is going to
> involve an awful lot of churn.
> 
> In the upstream code
> ====================
> 
> - Locking
> 
>   The locks provided by bio_prison (v1), are all exclusive locks.  As such
>   we take pains to hold them for as short a period as possible.  This means
>   holding them for the duration of an IO is completely out of the question.
>   Nonetheless, as pointed out in the original post for this thread, this
>   can cause bad lock contention, especially if the data block size is large.
> 
> - Quiescing
> 
>   Because we do not hold locks for the lifetime of the bios, we need
>   another way of tracking IO and quiescing regions.  This is what the
>   deferred_set component does.  Effectively it divides time up into
>   bins, and keeps a reference count of how many IOs are still in flight
>   for each bin.  To quiesce we grab a lock, and then wait for all bins
>   before this lock was acquired to drain.  Advantages of this approach
>   is it uses very little memory (I think we're currently running with
>   64 bins), and consumes v. little cpu.  But we're never specific about

curious, is "v." short for "very" (not "versus")?

>   which region we're waiting to quiesce, instead always waiting for all
>   IO older than a certain point to drain.  So we are certainly introducing
>   more latency here.
> 
> - Blocking
> 
>   A single thread services any operations that could block.
>   When there is work for this thread to perform a work queue item
>   is queued (do_worker()).  This then examines linked lists of work
>   (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
>   processes each list as a batch.  Batching like this is a mixed blessing;
>   it allows us to sort incoming bios so we can process bios to the same
>   region at the same time, but it is also v. bad for max latency, as we
>   have no idea which piece of work was there the longest.
> 
> Next iteration of the code
> ========================== 
> 
> - Locking
> 
>   bio_prison (v2) provides shared locks, and custom lock levels.  So,
>   at the expense of memory, we can hold shared locks for long periods
>   that cover the lifetime of the bio.  Acquiring a lock now blocks.
> 
> - Quiescing
> 
>   Because we hold the locks for long periods we can now do away with the
>   deferred set completely.  If you want to quiesce a region, just grab
>   the exclusive lock associated with it, when it's finally granted you
>   know it's also quiesced.
> 

good to know.

> - Blocking
> 
>   I want to move away from the idea of a single worker function that
>   has different categories of work stored for it in different lists.
>   Instead, storing specific work structs on the work queue for each bio.
>   Partly this is to reduce latency by increasing 'fairness'.  But also
>   the fact that acquiring a lock now blocks means there are a lot more
>   block operations to handle, and we'd just end up with a lot of these
>   lists of work.  It would also allow us to have multiple kernel threads
>   servicing the workqueue.
> 
>   If you look at dm-cache-target.c you'll see this has already been
>   done for that target.  We have continuation structs that represent
>   the work to be performed after the current blocking op has completed.
>   dm-cache uses this for migrations, which have a much simpler state model
>   than dm-thin.  Even so there are a lot of these little continuation
>   functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
>   mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
>   mg_success, mg_complete).
> 
> 
> Where are we now?
> =================
> 
> I did a lot of work on this a couple of years ago.  First I just dove
> in, trying to code things up by hand.  But it quickly devolved into a
> maze of badly named continuation functions, all alike.  It's very hard
> to give these functions meaningful names; go through the pseudocode at
> the top of this email and for each place where we could block, try to
> describe where we are.  The biggest problem is as we introduce more of
> these continuations big picture logic receeds and it becomes v. hard to
> reason about the code.

Event-driven continuation functions seem to pop up frequently in the Linux 
kernel.  It would be neat if there was a framework to write these 
procedurally.  Macros might help, could still be pretty ugly.  Almost 
needs GCC support.
 
> I then experimented with automatically generating all the code from a
> simpler specification (I used a lispy version of the pseudocode above).
> This showed promise and I got it generating kernel code that would
> compile.  I was debugging this when I got dragged onto other work,
> and this has stagnated since.

Do you think this is the best way to proceed?  Someone with Lisp 
background would need to help. It might generate first-pass code but would 
be difficult to maintain as kernel changes patch the auto-generated code.

-Eric

> 
> 
> So that's where we are.
>  
> 
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Reply via email to