On Tue, May 24, 2022 at 09:01:28AM -0500, Eric Blake wrote: > On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote: > > Hi, > > > > How does the blocksize filter take into account writes that end-up > > overlapping due to read-modify-write cycles? > > > > Specifically, suppose there are two non-overlapping writes handled by two > > different threads, that, due to blocksize requirements, overlap when > > expanded. I think there is a risk that one thread may partially undo the > > work of the other here. > > > > Looking at the code, it seems that writes of unaligned heads and tails are > > protected with a global lock., > > but writes of aligned data can occur concurrently. > > > > However, does this not miss the case where there is one unaligned write > > that overlaps with an aligned one? > > > > For example, with blocksize 10, we could have: > > The blocksize filter requires blocks to be sized as a power of 2, > which 10 is not. I will try to restate your question using hex > boundaries; please correct me if I'm mis-stating things. > > > > > Thread 1: receives write request for offset=0, size=10 > > Thread 2: receives write request for offset=4, size=16 > > minblock = 0x10 > Thread 1: receives write request for offset 0x00, size 0x10 (aligned request) > Thread 2: receives write request for offset 0x04, size 0x16 (unaligned > offset, unaligned size) > > Graphically, we are wanting to write the following, given initial disk > contents of I: > > 0 0 0 0 1 1 1 1 > 0...4...8...a...0...4...8...a... > start IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII > T1: AAAAAAAAAAAAAAAA > T2: BBBBBBBBBBBBBBBBBBBBBB > > Because both writes are issued simultaneously, we do not know whether > bytes 0x04 through 0x0f will be written as A or B. But our assumption > is that because blocks are written atomically, we hope to get exactly > one of the two following results, where either T1 beat T2:
The NBD protocol doesn't mention "atomic" anywhere. Are writes actually guaranteed to be atomic like this? I'd be a bit more convinced if this could be shown to happen in a non-overlapping case. (It may be that we still want to fix the blocksize filter as you've described -- using a rwlock seems like a reasonable and non-invasive fix -- but I'd like to understand what is guaranteed and what is a client's problem.) > 0 0 0 0 1 1 1 1 > 0...4...8...a...0...4...8...a... > end1: AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII > > or where T2 beat T1: > > 0 0 0 0 1 1 1 1 > 0...4...8...a...0...4...8...a... > end2: AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII > > However, you are worried that a third possibility occurs: > > > Thread 1: acquires lock, reads bytes 0-4 > > Thread 2: does aligned write (no locking needed), writes bytes 0-10 > > Thread 1: writes bytes 0-10, overwriting data from Thread 2 > > These do not match your initial conditions above (why is thread 1 > reading; why is thread 2 doing aligned action). Again, I'm rewriting > this into something that I think matches what you intended to ask, but > I welcome corrections: > > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f > for the unaligned head (it only needs 0x00-0x03, but we have to read a > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB. > > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock > blocking it. > > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in: > > 0 0 0 0 1 1 1 1 > 0...4...8...a...0...4...8...a... > end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII > > which does NOT reflect either of the possibilities where T1 and T2 > write atomically. Basically, we have become the victim of sharding. > > You are correct that it is annoying that this third possibility (where > T1 appears to have never run) is possible with the blocksize filter. > And we should probably consider it as a data-corruption bug. Your > blocksize example of 10 (or 0x10) bytes is unlikely, but we are more > likely to hit scenarios where an older guest assumes it is writing to > 512-byte aligned hardware, while using the blocksize filter to try and > guarantee RMW atomic access to 4k modern hardware. The older client > will be unaware that it must avoid parallel writes that are > 512-aligned but land in the same 4k page, so it seems like the > blocksize filter should be doing that. > > You have just demonstrated that our current approach (grabbing a > single semaphore, only around the unaligned portions), does not do > what we hoped. So what WOULD protect us, while still allowing as much > parallelism as possible? It sounds like we want a RW-lock (note: not > in the sense of parallel pread and exclusive pwrite operations, but > rather in the sense of parallel aligned operations taking a rdlock > whether it is a pread or pwrite operation, and exclusive unaligned > operations taking a wrlock whether pread or pwrite). > > pthread_rwlock_t would work, although it does not have guarantees on > whether the implementation will favor readers or writers. It is also > possible to implement our own using 2 mutexes (favors readers), or a > condvar and mutex (easy to favor writers) [1]. Favoring readers can > starve writers indefinitely but gets the most concurrency; favoring > writers avoids starvation but has less performance. Other policies > are RCU (wait-free for rdlock) or seqlock, borrowing ideas from the > Linux kernel. Maybe we make it a configurable knob which lock policy > to use, based on how likely the client is to do unaligned operations > that require a write lock? > > [1] https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs