On Thu, May 26, 2022 at 04:00:08PM -0500, Eric Blake wrote: > [We are still investigating if a CVE needs to be assigned.] > > We have a bug where the blocksize filter can lose the effects of an > aligned write that happens in parallel with an overlapping unaligned > write. In general, a client should be extremely cautious about > overlapping writes; it is not NBD's fault which of the two writes > lands on the disk (or even if both writes land partially, at > block-level granularities on which write landed last). However, a > client should still be able to assume that even when two writes > partially overlap, the final state of the disk of the non-overlapping > portion will be that of the one write explicitly touching that > portion, rather than stale data from before either write touched that > block. > > Fix the bug by switching the blocksize filter locking from a mutex to > a rwlock. We still need mutual exclusion during all unaligned > operations (whether reading or writing to the plugin), because we > share the same bounce buffer among all such code, which we get via the > mutual exclusion of wrlock. But we now also add in shared exclusion > for aligned write-like operations (pwrite, trim, zero); these can > operate in parallel with one another, but must not be allowed to fall > in the window between when an overlapping unaligned operation has read > from the plugin but not yet written, so that we do not end up > re-writing stale contents that undo the portion of the aligned write > that was outside the unaligned region [it's ironic that we only need > this protection on write I/O, but get it by using the rdlock feature > of the rwlock]. Read-like operations (pread, extents, cache) don't > need to be protected from RMW operations, because it is already the > user's problem if they execute parallel overlapping reads while > modifying the same portion of the image; and they will still end up > seeing either the state before or after the unaligned write, > indistinguishable from if we had added more locking. > > Whether this lock favors rdlock (aka aligned pwrite) or wrlock (aka > unaligned access) is implementation-defined by the pthread > implementation; if we need to be more precise, we'll have to introduce > our own rwlock implementation on top of lower-level primitives > (https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock mentions > the use of two mutex to favor readers, or a condvar/mutex to favor > writers). This is also not very fine-grained; it may turn out that we > could get more performance if instead of locking the entire operation > (across all clients, and regardless of whether the offsets overlap), > we instead just lock metadata and then track whether a new operation > overlaps with an existing unaligned operation, or even add in support > for parallel unaligned operations by having more than one bounce > buffer. But if performance truly matters, you're better off fixing > the client to do aligned access in the first place, rather than > needing the blocksize filter. > > Add a test that fails without the change to blocksize.c. With some > carefully timed setups (using the delay filter to stall writes > reaching the plugin, and the plugin itself to stall reads coming back > from the plugin, so that we are reasonably sure of the overall > timeline), we can demonstrate the bug present in the unpatched code, > where an aligned write is lost when it lands in the wrong part of an > unaligned RMW cycle; the fixed code instead shows that the overlapping > operations were serialized. We may need to further tweak the test to > be more reliable even under heavy load, but hopefully 2 seconds per > stall (where a successful test requires 18 seconds) is sufficient for > now. > > Reported-by: Nikolaus Rath <nikol...@rath.org> > Fixes: 1aadbf9971 ("blocksize: Allow parallel requests", v1.25.3) > --- > > In v2: > - Implement the blocksize fix. > - Drop RFC; I think this is ready, other than determining if we want > to tag it with a CVE number. > - Improve the test: print out more details before assertions, to aid > in debugging if it ever dies under CI resources > > tests/Makefile.am | 2 + > filters/blocksize/blocksize.c | 26 +++-- > tests/test-blocksize-sharding.sh | 168 +++++++++++++++++++++++++++++++ > 3 files changed, 187 insertions(+), 9 deletions(-) > create mode 100755 tests/test-blocksize-sharding.sh > > diff --git a/tests/Makefile.am b/tests/Makefile.am > index b6d30c0a..67e7618b 100644 > --- a/tests/Makefile.am > +++ b/tests/Makefile.am > @@ -1376,11 +1376,13 @@ TESTS += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > EXTRA_DIST += \ > test-blocksize.sh \ > test-blocksize-extents.sh \ > test-blocksize-default.sh \ > + test-blocksize-sharding.sh \ > $(NULL) > > # blocksize-policy filter test. > diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c > index 03da4971..1c81d5e3 100644 > --- a/filters/blocksize/blocksize.c > +++ b/filters/blocksize/blocksize.c > @@ -51,10 +51,15 @@ > > #define BLOCKSIZE_MIN_LIMIT (64U * 1024) > > -/* In order to handle parallel requests safely, this lock must be held > - * when using the bounce buffer. > +/* Lock in order to handle overlapping requests safely. > + * > + * Grabbed for exclusive access (wrlock) when using the bounce buffer. > + * > + * Grabbed for shared access (rdlock) when doing aligned writes. > + * These can happen in parallel with one another, but must not land in > + * between the read and write of an unaligned RMW operation. > */ > -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; > +static pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER; > > /* A single bounce buffer for alignment purposes, guarded by the lock. > * Size it to the maximum we allow for minblock. > @@ -255,7 +260,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, flags, err) == > -1) > @@ -278,7 +283,7 @@ blocksize_pread (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, flags, err) == -1) > return -1; > memcpy (buf, bounce, count); > @@ -306,7 +311,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -321,6 +326,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxdata, ROUND_DOWN (count, h->minblock)); > if (next->pwrite (next, buf, keep, offs, flags, err) == -1) > return -1; > @@ -331,7 +337,7 @@ blocksize_pwrite (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memcpy (bounce, buf, count); > @@ -371,6 +377,7 @@ blocksize_trim (nbdkit_next *next, > > /* Aligned body */ > while (count) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, count); > if (next->trim (next, keep, offs, flags, err) == -1) > return -1; > @@ -413,7 +420,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned head */ > if (offs & (h->minblock - 1)) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > drop = offs & (h->minblock - 1); > keep = MIN (h->minblock - drop, count); > if (next->pread (next, bounce, h->minblock, offs - drop, 0, err) == -1) > @@ -428,6 +435,7 @@ blocksize_zero (nbdkit_next *next, > > /* Aligned body */ > while (count >= h->minblock) { > + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&lock); > keep = MIN (h->maxlen, ROUND_DOWN (count, h->minblock)); > if (next->zero (next, keep, offs, flags, err) == -1) > return -1; > @@ -437,7 +445,7 @@ blocksize_zero (nbdkit_next *next, > > /* Unaligned tail */ > if (count) { > - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); > + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&lock); > if (next->pread (next, bounce, h->minblock, offs, 0, err) == -1) > return -1; > memset (bounce, 0, count); > > diff --git a/tests/test-blocksize-sharding.sh > b/tests/test-blocksize-sharding.sh > new file mode 100755 > index 00000000..177668ed > --- /dev/null > +++ b/tests/test-blocksize-sharding.sh > @@ -0,0 +1,168 @@ > +#!/usr/bin/env bash > +# nbdkit > +# Copyright (C) 2018-2022 Red Hat Inc. > +# > +# Redistribution and use in source and binary forms, with or without > +# modification, are permitted provided that the following conditions are > +# met: > +# > +# * Redistributions of source code must retain the above copyright > +# notice, this list of conditions and the following disclaimer. > +# > +# * Redistributions in binary form must reproduce the above copyright > +# notice, this list of conditions and the following disclaimer in the > +# documentation and/or other materials provided with the distribution. > +# > +# * Neither the name of Red Hat nor the names of its contributors may be > +# used to endorse or promote products derived from this software without > +# specific prior written permission. > +# > +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND > +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, > +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A > +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR > +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF > +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND > +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, > +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT > +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > +# SUCH DAMAGE. > + > +# Demonstrate a fix for a bug where blocksize could lose aligned writes > +# run in parallel with unaligned writes > + > +source ./functions.sh > +set -e > +set -x > + > +requires_plugin eval > +requires_nbdsh_uri > + > +files='blocksize-sharding.img blocksize-sharding.tmp' > +rm -f $files > +cleanup_fn rm -f $files > + > +# Script a server that requires 16-byte aligned requests, and which delays > +# 4s after reads if a witness file exists. Couple it with the delay filter > +# that delays 2s before writes. If an unaligned and aligned write overlap, > +# and can execute in parallel, we would have this timeline: > +# > +# T1 aligned write 1's to 0/16 T2 unaligned write 2's to 4/12 > +#t=0 blocksize: next->pwrite(0, 16) blocksize: next->pread(0, 16) > +# delay: wait 2s delay: next->pread(0, 16) > +# ... eval: read 0's, wait 4s > +#t=2 delay: next->pwrite(0, 16) ... > +# eval: write 1's ... > +# return ... > +#t=4 return 0's (now stale) > +# blocksize: next->pwrite(0, 16) > +# delay: wait 2s > +#t=6 delay: next->pwrite(0, 16) > +# eval: write stale RMW buffer > +# > +# leaving us with a sharded 0000222222222222 (T1's write is lost). > +# But as long as the blocksize filter detects the overlap, we should end > +# up with either 1111222222222222 (aligned write completed first), or with > +# 1111111111111111 (unaligned write completed first), either taking 8s, > +# but with no sharding. > +# > +# We also need an nbdsh script that kicks off parallel writes. > +export script=' > +import os > +import time > + > +witness = os.getenv("witness") > + > +def touch(path): > + open(path, "a").close() > + > +# First pass: check that two aligned operations work in parallel > +# Total time should be closer to 2 seconds, rather than 4 if serialized > +print("sanity check") > +ba1 = bytearray(b"1"*16) > +ba2 = bytearray(b"2"*16) > +buf1 = nbd.Buffer.from_bytearray(ba1) > +buf2 = nbd.Buffer.from_bytearray(ba2) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf1, 0) > +h.aio_pwrite(buf2, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"1"*16, b"2"*16] > +assert t >= 2 and t <= 3 > + > +# Next pass: try to kick off unaligned first > +print("unaligned first") > +h.zero(16, 0) > +ba3 = bytearray(b"3"*12) > +ba4 = bytearray(b"4"*16) > +buf3 = nbd.Buffer.from_bytearray(ba3) > +buf4 = nbd.Buffer.from_bytearray(ba4) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf3, 4) > +h.aio_pwrite(buf4, 0) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"4"*4 + b"3"*12, b"4"*16] > +assert t >= 8 > + > +# Next pass: try to kick off aligned first > +print("aligned first") > +ba5 = bytearray(b"5"*16) > +ba6 = bytearray(b"6"*12) > +buf5 = nbd.Buffer.from_bytearray(ba5) > +buf6 = nbd.Buffer.from_bytearray(ba6) > +h.zero(16, 0) > +touch(witness) > +start_t = time.time() > +h.aio_pwrite(buf5, 0) > +h.aio_pwrite(buf6, 4) > + > +while h.aio_in_flight() > 0: > + h.poll(-1) > +end_t = time.time() > +os.unlink(witness) > + > +out = h.pread(16,0) > +print(out) > +t = end_t - start_t > +print(t) > +assert out in [b"5"*4 + b"6"*12, b"5"*16] > +assert t >= 8 > +' > + > +# Now run everything > +truncate --size=16 blocksize-sharding.img > +export witness="$PWD/blocksize-sharding.tmp" > +nbdkit -U - --filter=blocksize --filter=delay eval delay-write=2 \ > + config='ln -sf "$(realpath "$3")" $tmpdir/$2' \ > + img="$PWD/blocksize-sharding.img" tmp="$PWD/blocksize-sharding.tmp" \ > + get_size='echo 16' block_size='echo 16 64K 1M' \ > + thread_model='echo parallel' \ > + zero='dd if=/dev/zero of=$tmpdir/img skip=$4 count=$3 \ > + iflag=count_bytes,skip_bytes' \ > + pread=' > + dd if=$tmpdir/img skip=$4 count=$3 iflag=count_bytes,skip_bytes > + if [ -f $tmpdir/tmp ]; then sleep 4; fi ' \ > + pwrite='dd of=$tmpdir/img seek=$4 conv=notrunc oflag=seek_bytes' \ > + --run 'nbdsh -u "$uri" -c "$script"'
Reviewed-by: Richard W.M. Jones <rjo...@redhat.com> I have no preference on whether or not to put the test into a separate commit. I'm not sure I understand where the potential CVE is. In what scenario could the old filter be exploited? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs