On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <and...@anarazel.de> wrote: > > However, our habit of modifying buffers while IO is going on is > > causing issues with filesystem level checksums as well, as evidenced by the > > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I > > tend to think it'd be better to just stop doing that alltogether (we also do > > that for WAL, when writing out a partial page, but a potential fix there > > would > > be different, I think). > > +many. Interesting point re the WAL variant. For the record, here's > some discussion and a repro for that problem, which Andrew currently > works around in a build farm animal with mount options: > > https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
Here's an interesting new development in that area, this time from OpenZFS, which committed its long awaited O_DIRECT support a couple of weeks ago[1] and seems to have taken a different direction since that last discussion. Clearly it has the same checksum stability problem as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of avoiding copying and caching must confront that and break *something*, or accept something like bounce buffers and give up the zero-copy goal. Curiously, they seem to have landed on two different solutions with three different possible behaviours: (1) On FreeBSD, temporarily make the memory non-writeable, (2) On Linux, they couldn't do that so they have an extra checksum verification on write. I haven't fully grokked all this yet, or even tried it, and it's not released or anything, but it looks a bit like all three behaviours are bad for our current hint bit design: on FreeBSD, setting a hint bit might crash (?) if a write is in progress in another process, and on Linux, depending on zfs_vdev_direct_write_verify, either the concurrent write might fail (= checkpointer failing on EIO because someone concurrently set a hint bit) or a later read might fail (= file is permanently corrupted and you don't find out until later, like btrfs). I plan to look more closely soon and see if I understood that right... [1] https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda