On 2025/05/28 22:00, 'Eric Blake' via devel wrote:
On Wed, May 28, 2025 at 08:30:05PM +0900, Akihiko Odaki wrote:
file-posix used to assume that existing holes satisfy the requested
alignment, which equals to the estimated direct I/O alignment
requirement if direct I/O is requested, and assert the assumption
unless it is at EOF.

However, the estimation of direct I/O alignment requirement is sometimes
inexact and can be overly strict. For example, I observed that QEMU
estimated the alignment requirement as 16K while the real requirement
is 4K when Btrfs is used on Linux 6.14.6 and the host page size is 16K.

For direct I/O alignment, open(2) sugguests as follows:
Since Linux 6.1, O_DIRECT support and alignment restrictions for a
file can be queried using statx(2), using the STATX_DIOALIGN flag.
Support for STATX_DIOALIGN varies by filesystem; see statx(2).

We really should be using statx() in the block/ subdirectory (even
though we aren't yet) - over time, more and more filesystems WILL
support it, and it is a more precise answer than anything else.

I found the following patch searching qemu-devel:
https://lore.kernel.org/qemu-devel/20221103183609.363027-3-stefa...@redhat.com/



Some filesystems provide their own interfaces for querying O_DIRECT
alignment restrictions, for example the XFS_IOC_DIOINFO operation in
xfsctl(3). STATX_DIOALIGN should be used instead when it is available.

If none of the above is available, then direct I/O support and
alignment restrictions can only be assumed from known characteristics
of the filesystem, the individual file, the underlying storage
device(s), and the kernel version. In Linux 2.4, most filesystems
based on block devices require that the file offset and the length and
memory address of all I/O segments be multiples of the filesystem
block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to
the logical block size of the block device (typically 512 bytes). A
block device's logical block size can be determined using the ioctl(2)
BLKSSZGET operation or from the shell using the command:

Apparently Btrfs doesn't support STATX_DIOALIGN nor provide its own
interface for querying the requirement. Using BLKSSZGET brings another
problem to determine the underlying block device, which also involves
heuristics.

Moreover, even if we could figure out the direct I/O alignment
requirement, I could not find a documentation saying it will exactly
matche with the alignment of holes.

s/matche/match/

I'll fix it with the next version.



So stop asserting the assumption on the holes and tolerate them being
unaligned.

Tolerating unaligned holes is wise, even if we can improve our probing
to be more accurate in other patches.  That said...


Signed-off-by: Akihiko Odaki <akihiko.od...@daynix.com>
---
  block/file-posix.c | 19 +++++++++----------
  1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ec95b748696b..7b686ce6817d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3307,22 +3307,21 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
          *pnum = bytes;
          ret = BDRV_BLOCK_DATA;
      } else if (data == offset) {
-        /* On a data extent, compute bytes to the end of the extent,
-         * possibly including a partial sector at EOF. */
+        /* On a data extent, compute bytes to the end of the extent. */
          *pnum = hole - offset;
/*
+         * We may have allocation unaligned with the requested alignment due to
+         * the following reaons:
+         * - unaligned file size
+         * - inexact direct I/O alignment requirement estimation
+         * - mismatches between the allocation size and
+         *   direct I/O alignment requirement.
+         *
           * We are not allowed to return partial sectors, though, so
           * round up if necessary.
           */
-        if (!QEMU_IS_ALIGNED(*pnum, bs->bl.request_alignment)) {
-            int64_t file_length = raw_getlength(bs);
-            if (file_length > 0) {
-                /* Ignore errors, this is just a safeguard */
-                assert(hole == file_length);
-            }
-            *pnum = ROUND_UP(*pnum, bs->bl.request_alignment);
-        }
+        *pnum = ROUND_UP(*pnum, bs->bl.request_alignment);
ret = BDRV_BLOCK_DATA;

...rounding data extent sizes up to the alignment that the rest of the
block layer sees is always safe.  But I would expect some symmetry -
anywhere we are rounding up to report data instead of hole, there
either needs to be counterpart code on the holes or else a good reason
why the holes don't need matching code, where unaligned holes are
rounded down to alignment boundaries (and if that rounds down to 0,
report data instead).  That way, you can't get different answers based
on where in the sector you are asking the question.  We do know that
the block layer is supposed to only be asking the question at the
start of an alignment boundary (even when our alignment boundaries are
too large, such as your mention of 16k alignment when the filesystem
supports 4k holes).  At best, it may just be a matter of adding
comments to document why we are safe, but I'm not sure that is
sufficient for this patch.

Restated visually, if we have | at 16k boundaries (what qemu picked as
the dio alignment), + at 4k boundaries (the granularity of holes that
the fs supports), and the following file structure with Data and Holes
marked:

|+++|+++|+++|+++|+++|
DDDDDHHHHDDHDDDDHHHH

Then the claim is that the block layer will only ever ask for status
at the | points (and not at the +), and the results that it should see
after rounding are as follows (where lowercase respresents changed
results due to rounding to alignment):

|+++|+++|+++|+++|+++|
DDDDDHHHHDDHDDDDHHHH
DDDDDddddDDdDDDDHHHH
         ^

But the important question I don't see in your patch is whether you
handle a block_status query at the point of ^ correctly (it lands in a
hole, but since there is data also present in that alignment boundary,
you have to report the entire 16k as data).

Thank you for a detailed explanation. I will ensure that offset + *pnum will be aligned with bs->bl.request_alignment with the next version, which I think semantically makes sense.

Reply via email to