On May 25 2022, Eric Blake <ebl...@redhat.com> wrote: > On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath wrote: >> > However, you are worried that a third possibility occurs: >> > >> > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f >> > for the unaligned head (it only needs 0x00-0x03, but we have to read a >> > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB. >> > >> > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock >> > blocking it. >> > >> > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in: >> > >> > 0 0 0 0 1 1 1 1 >> > 0...4...8...a...0...4...8...a... >> > end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII >> > >> > which does NOT reflect either of the possibilities where T1 and T2 >> > write atomically. Basically, we have become the victim of sharding. >> >> Yes, this is the scenario that I am worried about. >> >> I this is a data corruption problem no matter if we assume that writes >> should be atomically or not. > > Writes to a single block should be atomic. Writes larger than a block > need not be atomic. But when we advertise a particular block size, > clients should be safe in assuming that anything smaller than that > block size is inadvisable (either it will fail as too small, or it > will incur a RMW penalty), but that something at the block size should > not need client-side protection when no other parallel requests are > overlapping that block. And that is where our blocksize filter is > currently failing.
Agreed. >> > You have just demonstrated that our current approach (grabbing a >> > single semaphore, only around the unaligned portions), does not do >> > what we hoped. So what WOULD protect us, while still allowing as much >> > parallelism as possible? >> >> How about a per-block lock as implemented for the S3 plugin in >> https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10? > > That feels pretty heavyweight. It may allow more parallelism, but at > the expense of more resources. I'm hoping that a single rwlock will > do, although it may be dominated by starvation time when toggling > between unaligned actions (serialized) vs aligned actions (parallel). The resource usage is a single shared variable with an underlying mutex, and a set (probably best implemented as a hash) containing at most as many elements as block are being written in concurrently. This does not sound like much to me. Am I missing something? Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs