On Thu, Apr 17, 2025 at 1:06 AM Christian König <christian.koe...@amd.com> wrote: > > Am 16.04.25 um 23:51 schrieb Juan Yescas: > > On Wed, Apr 16, 2025 at 4:34 AM Christian König > <christian.koe...@amd.com> wrote: > > > Am 15.04.25 um 19:19 schrieb Juan Yescas: > > This change sets the allocation orders for the different page sizes > (4k, 16k, 64k) based on PAGE_SHIFT. Before this change, the orders > for large page sizes were calculated incorrectly, this caused system > heap to allocate from 2% to 4% more memory on 16KiB page size kernels. > > This change was tested on 4k/16k page size kernels. > > Signed-off-by: Juan Yescas <jyes...@google.com> > --- > drivers/dma-buf/heaps/system_heap.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/dma-buf/heaps/system_heap.c > b/drivers/dma-buf/heaps/system_heap.c > index 26d5dc89ea16..54674c02dcb4 100644 > --- a/drivers/dma-buf/heaps/system_heap.c > +++ b/drivers/dma-buf/heaps/system_heap.c > @@ -50,8 +50,15 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP, > HIGH_ORDER_GFP, LOW_ORDER_GFP}; > * to match with the sizes often found in IOMMUs. Using order 4 pages instead > * of order 0 pages can significantly improve the performance of many IOMMUs > * by reducing TLB pressure and time spent updating page tables. > + * > + * Note: When the order is 0, the minimum allocation is PAGE_SIZE. The > possible > + * page sizes for ARM devices could be 4K, 16K and 64K. > */ > -static const unsigned int orders[] = {8, 4, 0}; > +#define ORDER_1M (20 - PAGE_SHIFT) > +#define ORDER_64K (16 - PAGE_SHIFT) > +#define ORDER_FOR_PAGE_SIZE (0) > +static const unsigned int orders[] = {ORDER_1M, ORDER_64K, > ORDER_FOR_PAGE_SIZE}; > +# > > Good catch, but I think the defines are just overkill. > > What you should do instead is to subtract page shift when using the array. > > There are several occurrences of orders in the file and I think it is > better to do the calculations up front in the array. Would you be ok > if we get rid of the defines as per your suggestion and make the > calculations during the definition of the array. Something like this: > > static const unsigned int orders[] = {20 - PAGE_SHIFT, 16 - PAGE_SHIFT, 0}; > > > Yeah that totally works for me as well. Just make sure that a code comment > nearby explains why 20, 16 and 0. >
Thanks, the changes were added in the [PATCH v3]. > Apart from that using 1M, 64K and then falling back to 4K just sounds random > to me. We have especially pushed back on 64K more than once because it is > actually not beneficial in almost all cases. > > In the hardware where the driver is used, the 64K is beneficial for: > > Arm SMMUv3 (4KB granule size): > 64KB (contiguous bit groups 16 4KB pages) > > SysMMU benefits from 64KB (“large” page) on 4k/16k page sizes. > > > Question could this HW also work with pages larger than 64K? (I strongly > expect yes). > Yes, if the page sizes in the SMMUv3 are: - 4k, we can have 2MiB pages - 16k, we can have 32MiB pages > I suggest to fix the code in system_heap_allocate to not over allocate > instead and just try the available orders like TTM does. This has proven to > be working architecture independent. > > Do you mean to have an implementation similar to __ttm_pool_alloc()? > > > Yeah something like that. > > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/ttm/ttm_pool.c?h=v6.15-rc2#n728 > > If that is the case, we can try the change, run some benchmarks and > submit the patch if we see positive results. > > > Could be that this doesn't matter for your platform, but it's minimal extra > overhead asking for larger chunks first and it just avoids fragmenting > everything into 64K chunks. > > That is kind of important since the code in DMA-heaps should be platform > neutral if possible. I agree, I'll make the change in another patch. Thanks Juan > > Regards, > Christian. > > > Thanks > Juan > > Regards, > Christian. > > #define NUM_ORDERS ARRAY_SIZE(orders) > > static struct sg_table *dup_sg_table(struct sg_table *table) > >