On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
> making that a bit cheaper as well.
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
>     
> Signed-off-by: Jens Axboe <ax...@kernel.dk>
> 
> ---
> 
> This patch is on top of the round robin fix for sbitmap just posted.

So I think this can lead to hangs due to interactions with wake_batch.
Bear with me:

Let's say the sbitmap_queue was created with depth=64 and shift=6, i.e.,
64 bits all in one word. In this case, wake_batch=8, because 64 bits
clearing should be enough to wake up in 8 batches of 8.

Let's say all 64 bits are allocated and there are 8 waiters. All 64 bits
are then freed (in the cleared word), so we wake up the 8 waiters. Let's
say they all attempt __sbitmap_get_word(), fail, and get to
sbitmap_deferred_clear() at the same time. One of them does the
cmpxchg() on cleared, setting it to zero, and then the rest see that
cleared is zero, so they return because there don't appear to be any
cleared bits. The first thread succeeds, and the rest go to sleep. 

Now, when the first thread clears the bit, it only subtracts one from
the batch, which isn't enough to do a wakeup. Hang!

This example is contrived, but in general I think that the window
between the cleared mask being zeroed and the word being cleared is
going to cause problems.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 804a50983ec5..cec685b89998 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -30,14 +30,19 @@ struct seq_file;
>   */
>  struct sbitmap_word {
>       /**
> -      * @word: The bitmap word itself.
> +      * @depth: Number of bits being used in @word/@cleared
>        */
> -     unsigned long word;
> +     unsigned long depth;
>  
>       /**
> -      * @depth: Number of bits being used in @word.
> +      * @word: word holding free bits
>        */
> -     unsigned long depth;
> +     unsigned long word ____cacheline_aligned_in_smp;

Completely separate note (assuming that we can iron out the race above),
this triples the size of each map. Does the word have to be in a
separate cacheline from the depth? Ignoring resize, depth and word are
always accessed together, so it seems to me that it would benefit from
sharing a cacheline now that clearing happens elsewhere.

> +     /**
> +      * @cleared: word holding cleared bits
> +      */
> +     unsigned long cleared ____cacheline_aligned_in_smp;
>  } ____cacheline_aligned_in_smp;

Reply via email to