On Mon, Jun 22, 2026 at 4:13 AM David Laight <[email protected]> wrote: >
Hi David, Thank you for your review. You raised many good points regarding optimizations here. I'll switch to using 2G as the max entry size (`SZ_2G` from `linux/sizes.h`), and remove divisions and multiplications. I'll also replace the `for()` loop with `while (length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G` to `size_t`. I'll send out a v2 with these changes shortly. Thanks, David > > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`. > > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the > > first entry, resulting in non-page-aligned DMA addresses for all > > subsequent entries. > > How did you find this? > It requires a single buffer over 4GB - seems highly unlikely. It was observed during experiments with buffers over 8GB on an accelerator. > > > > While the underlying IOMMU mapping may be contiguous, hardware > > DMA engines often require explicit address alignment (e.g., page, > > cacheline, or storage sector boundaries). Passing unaligned > > addresses and lengths can cause explicit failures in DMA descriptor > > creation or silent data corruption if lower unaligned bits are > > truncated. > > > > Fix this by splitting the scatterlist by the largest possible page > > aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`). > > This ensures all scatterlist DMA addresses and lengths remain page > > aligned and satisfy hardware constraints. > > It would almost certainly better to spilt into 2G chunks. > That removes any need for any divisions. I agree. 2G naturally aligns with most hardware boundaries, while also allowing compiler optimizations with simple bit shifts. > > > Page-aligned entries allow the system to cleanly chunk payloads into > > PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes). > > As a result, this may help reduce TLP fragmentation in P2P transfers > > and alleviate potential congestion within a logical PCIe switch > > partition, especially when Relaxed Ordering is not possible due to > > hardware constraints. > > > > Reported-by: sashiko-bot <[email protected]> > > Closes: > > https://lore.kernel.org/all/[email protected]/ > > Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping > > routine") > > Cc: [email protected] > > Signed-off-by: David Hu <[email protected]> > > --- > > drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++----- > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/dma-buf/dma-buf-mapping.c > > b/drivers/dma-buf/dma-buf-mapping.c > > index 794acff2546a..f2bde38fdb1f 100644 > > --- a/drivers/dma-buf/dma-buf-mapping.c > > +++ b/drivers/dma-buf/dma-buf-mapping.c > > @@ -5,6 +5,9 @@ > > */ > > #include <linux/dma-buf-mapping.h> > > #include <linux/dma-resv.h> > > +#include <linux/align.h> > > + > > +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE) > > > > > static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t > > length, > > dma_addr_t addr) > > @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct > > scatterlist *sgl, size_t length, > > unsigned int len, nents; > > int i; > > > > - nents = DIV_ROUND_UP(length, UINT_MAX); > > + nents = DIV_ROUND_UP(length, MAX_ENT_SZ); > > for (i = 0; i < nents; i++) { > > Why not change that to 'while (length) {' to avoid the division above. Sounds good, will do. > > > - len = min_t(size_t, length, UINT_MAX); > > + len = min_t(size_t, length, MAX_ENT_SZ); > > I bet that doesn't need to be min_t() Agreed. > > > length -= len; > > /* > > * DMABUF abuses scatterlist to create a scatterlist > > @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct > > scatterlist *sgl, size_t length, > > * does not require the CPU list for mapping or unmapping. > > */ > > sg_set_page(sgl, NULL, 0, 0); > > - sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; > > + sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ; > > sg_dma_len(sgl) = len; > > Replace the multiply with 'addr += len'. Will update this as well. > > -- David > > > sgl = sg_next(sgl); > > } > > @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state > > *state, > > > > if (!state || !dma_use_iova(state)) { > > for (i = 0; i < nr_ranges; i++) > > - nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); > > + nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ); > > } else { > > /* > > * In IOVA case, there is only one SG entry which spans > > * for whole IOVA address space, but we need to make sure > > * that it fits sg->length, maybe we need more. > > */ > > - nents = DIV_ROUND_UP(size, UINT_MAX); > > + nents = DIV_ROUND_UP(size, MAX_ENT_SZ); > > } > > > > return nents; >
