[PATCH v1 2/2] mm: enforce pageblock_order < MAX_ORDER
Some places in the kernel don't really expect pageblock_order >= MAX_ORDER, and it looks like this is only possible in corner cases: 1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order pages via __free_pages_core(), which cannot possibly work. 2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger pageblock_order, we could have a single pageblock partially managed by two zones. 3) compaction code runs into __fragmentation_index() with order >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1] 4) mm/page_reporting.c won't be reporting any pages with default page_reporting_order == pageblock_order, as we'll be skipping the reporting loop inside page_reporting_process_zone(). 5) __rmqueue_fallback() will never be able to steal with ALLOC_NOFRAGMENT. pageblock_order >= MAX_ORDER is weird either way: it's a pure optimization for making alloc_contig_range(), as used for allcoation of gigantic pages, a little more reliable to succeed. However, if there is demand for somewhat reliable allocation of gigantic pages, affected setups should be using CMA or boottime allocations instead. So let's make sure that pageblock_order < MAX_ORDER and simplify. [1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com Signed-off-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 9 +++-- include/linux/cma.h | 3 +-- include/linux/pageblock-flags.h | 7 +-- mm/Kconfig | 3 +++ mm/page_alloc.c | 32 5 files changed, 20 insertions(+), 34 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 38becd8d578c..e7d6b679596d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2476,13 +2476,10 @@ static int virtio_mem_init_hotplug(struct virtio_mem *vm) VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD); /* -* We want subblocks to span at least MAX_ORDER_NR_PAGES and -* pageblock_nr_pages pages. This: -* - Is required for now for alloc_contig_range() to work reliably - -* it doesn't properly handle smaller granularity on ZONE_NORMAL. +* TODO: once alloc_contig_range() works reliably with pageblock +* granularity on ZONE_NORMAL, use pageblock_nr_pages instead. */ - sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES, - pageblock_nr_pages) * PAGE_SIZE; + sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES; sb_size = max_t(uint64_t, vm->device_block_size, sb_size); if (sb_size < memory_block_size_bytes() && !force_bbm) { diff --git a/include/linux/cma.h b/include/linux/cma.h index 75fe188ec4a1..b1ba94f1cc9c 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -25,8 +25,7 @@ * -- can deal with only some pageblocks of a higher-order page being * MIGRATE_CMA, we can use pageblock_nr_pages. */ -#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \ - pageblock_nr_pages) +#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES) struct cma; diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 973fd731a520..83c7248053a1 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -37,8 +37,11 @@ extern unsigned int pageblock_order; #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ -/* Huge pages are a constant size */ -#define pageblock_orderHUGETLB_PAGE_ORDER +/* + * Huge pages are a constant size, but don't exceed the maximum allocation + * granularity. + */ +#define pageblock_ordermin_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ diff --git a/mm/Kconfig b/mm/Kconfig index 3326ee3903f3..4c91b92e7537 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available on a platform. + Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be + clamped down to MAX_ORDER - 1. + config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..04cf964b57b5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1072,14 +1072,12 @@ static inline void __free_one_page(struct page *page, int migratetype, fpi_t fpi_flags) { struct capture_control *capc = task_capc(zone); + unsigned int max_order = pageblock_order; unsigned long buddy_pfn; unsigned long combined_pfn; - unsigned int max_order; struct page *buddy; bool
[PATCH v1 1/2] cma: factor out minimum alignment requirement
Let's factor out determining the minimum alignment requirement for CMA and add a helpful comment. No functional change intended. Signed-off-by: David Hildenbrand --- arch/powerpc/include/asm/fadump-internal.h | 5 - arch/powerpc/kernel/fadump.c | 2 +- drivers/of/of_reserved_mem.c | 9 +++-- include/linux/cma.h| 9 + kernel/dma/contiguous.c| 4 +--- mm/cma.c | 20 +--- 6 files changed, 19 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h index 52189928ec08..81bcb9abb371 100644 --- a/arch/powerpc/include/asm/fadump-internal.h +++ b/arch/powerpc/include/asm/fadump-internal.h @@ -19,11 +19,6 @@ #define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt) -/* Alignment per CMA requirement. */ -#define FADUMP_CMA_ALIGNMENT (PAGE_SIZE << \ -max_t(unsigned long, MAX_ORDER - 1,\ -pageblock_order)) - /* FAD commands */ #define FADUMP_REGISTER1 #define FADUMP_UNREGISTER 2 diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index d03e488cfe9c..7eb67201ea41 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -544,7 +544,7 @@ int __init fadump_reserve_mem(void) if (!fw_dump.nocma) { fw_dump.boot_memory_size = ALIGN(fw_dump.boot_memory_size, - FADUMP_CMA_ALIGNMENT); + CMA_MIN_ALIGNMENT_BYTES); } #endif diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 9c0fb962c22b..75caa6f5d36f 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "of_private.h" @@ -116,12 +117,8 @@ static int __init __reserved_mem_alloc_size(unsigned long node, if (IS_ENABLED(CONFIG_CMA) && of_flat_dt_is_compatible(node, "shared-dma-pool") && of_get_flat_dt_prop(node, "reusable", NULL) - && !nomap) { - unsigned long order = - max_t(unsigned long, MAX_ORDER - 1, pageblock_order); - - align = max(align, (phys_addr_t)PAGE_SIZE << order); - } + && !nomap) + align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES); prop = of_get_flat_dt_prop(node, "alloc-ranges", ); if (prop) { diff --git a/include/linux/cma.h b/include/linux/cma.h index bd801023504b..75fe188ec4a1 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -20,6 +20,15 @@ #define CMA_MAX_NAME 64 +/* + * TODO: once the buddy -- especially pageblock merging and alloc_contig_range() + * -- can deal with only some pageblocks of a higher-order page being + * MIGRATE_CMA, we can use pageblock_nr_pages. + */ +#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \ + pageblock_nr_pages) +#define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES) + struct cma; extern unsigned long totalcma_pages; diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 3d63d91cba5c..6ea80ae42622 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -399,8 +399,6 @@ static const struct reserved_mem_ops rmem_cma_ops = { static int __init rmem_cma_setup(struct reserved_mem *rmem) { - phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order); - phys_addr_t mask = align - 1; unsigned long node = rmem->fdt_node; bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL); struct cma *cma; @@ -416,7 +414,7 @@ static int __init rmem_cma_setup(struct reserved_mem *rmem) of_get_flat_dt_prop(node, "no-map", NULL)) return -EINVAL; - if ((rmem->base & mask) || (rmem->size & mask)) { + if (!IS_ALIGNED(rmem->base | rmem->size, CMA_MIN_ALIGNMENT_BYTES)) { pr_err("Reserved memory: incorrect alignment of CMA region\n"); return -EINVAL; } diff --git a/mm/cma.c b/mm/cma.c index bc9ca8f3c487..5a2cd5851658 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -168,7 +168,6 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, struct cma **res_cma) { struct cma *cma; - phys_addr_t alignment; /* Sanity checks */ if (cma_area_count == ARRAY_SIZE(cma_areas)) { @@ -179,15 +178,12 @@ int __init cma_init_reserved_mem(phys_addr_t base
[PATCH v1 0/2] mm: enforce pageblock_order < MAX_ORDER
Having pageblock_order >= MAX_ORDER seems to be able to happen in corner cases and some parts of the kernel are not prepared for it. For example, Aneesh has shown [1] that such kernels can be compiled on ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during boot. We can get pageblock_order >= MAX_ORDER when the default hugetlb size is bigger than the maximum allocation granularity of the buddy, in which case we are no longer talking about huge pages but instead gigantic pages. Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of such gigantic pages more likely to succeed. Reliable use of gigantic pages either requires boot time allcoation or CMA, no need to overcomplicate some places in the kernel to optimize for corner cases that are broken in other areas of the kernel. Let's enforce pageblock_order < MAX_ORDER and simplify. Especially patch #1 can be regarded a cleanup before: [PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range alignment. [2] [1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com [2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi@sent.com Cc: Andrew Morton Cc: Aneesh Kumar K V Cc: Zi Yan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Rob Herring Cc: Frank Rowand Cc: "Michael S. Tsirkin" Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Cc: Minchan Kim Cc: Vlastimil Babka Cc: linuxppc-...@lists.ozlabs.org Cc: devicet...@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: iommu@lists.linux-foundation.org Cc: linux...@kvack.org David Hildenbrand (2): cma: factor out minimum alignment requirement mm: enforce pageblock_order < MAX_ORDER arch/powerpc/include/asm/fadump-internal.h | 5 arch/powerpc/kernel/fadump.c | 2 +- drivers/of/of_reserved_mem.c | 9 ++ drivers/virtio/virtio_mem.c| 9 ++ include/linux/cma.h| 8 ++ include/linux/pageblock-flags.h| 7 +++-- kernel/dma/contiguous.c| 4 +-- mm/Kconfig | 3 ++ mm/cma.c | 20 -- mm/page_alloc.c| 32 ++ 10 files changed, 37 insertions(+), 62 deletions(-) base-commit: 754e0b0e35608ed5206d6a67a791563c631cec07 -- 2.34.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On 02.02.22 13:18, Oscar Salvador wrote: > On Wed, Jan 19, 2022 at 02:06:19PM -0500, Zi Yan wrote: >> From: Zi Yan >> >> Enable set_migratetype_isolate() to check specified sub-range for >> unmovable pages during isolation. Page isolation is done >> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all >> pages within that granularity are intended to be isolated. For example, >> alloc_contig_range(), which uses page isolation, allows ranges without >> alignment. This commit makes unmovable page check only look for >> interesting pages, so that page isolation can succeed for any >> non-overlapping ranges. > > Another thing that came to my mind. > Prior to this patch, has_unmovable_pages() was checking on pageblock > granularity, starting at pfn#0 of the pageblock. > With this patch, you no longer check on pageblock granularity, which > means you might isolate a pageblock, but some pages that sneaked in > might actually be unmovable. > > E.g: > > Let's say you have a pageblock that spans (pfn#512,pfn#1024), > and you pass alloc_contig_range() (pfn#514,pfn#1024). > has_unmovable_pages() will start checking the pageblock at pfn#514, > and so it will mis pfn#512 and pfn#513. Isn't that a problem, if those > pfn turn out to be actually unmovable? That's the whole idea for being able to allocate parts of an unmovable pageblock that are movable. If the first part is unmovable but the second part is movable, nothing should stop us from trying to allocate the second part. Of course, we want to remember the original migratetype of the pageblock, to restore that after isolation -- otherwise we'll end up converting all such pageblocks to MIGRATE_MOVABLE. The next patch does that IIRC. However, devil is in the detail, and I still have to review those parts of this series. Note that there are no current users of alloc_contig_range() that allocate < MAX_ORDER - 1 -- except CMA, but for CMA we immediately exit has_unmovable_pages() either way. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 7/8] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.
On 05.01.22 22:47, Zi Yan wrote: > From: Zi Yan > > alloc_contig_range() now only needs to be aligned to pageblock_order, > drop virtio_mem size requirement that it needs to be the max of > pageblock_order and MAX_ORDER. > > Signed-off-by: Zi Yan > --- > drivers/virtio/virtio_mem.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index a6a78685cfbe..2664dc16d0f9 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -2481,8 +2481,7 @@ static int virtio_mem_init_hotplug(struct virtio_mem > *vm) >* - Is required for now for alloc_contig_range() to work reliably - >* it doesn't properly handle smaller granularity on ZONE_NORMAL. >*/ Please also update this comment. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 5/8] mm: page_isolation: check specified range for unmovable pages during isolation.
On 05.01.22 22:47, Zi Yan wrote: > From: Zi Yan > > Enable set_migratetype_isolate() to check specified sub-range for > unmovable pages during isolation. Page isolation is done > at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all > pages within that granularity are intended to be isolated. For example, > alloc_contig_range(), which uses page isolation, allows ranges without > alignment. This commit makes unmovable page check only look for > interesting pages, so that page isolation can succeed for any > non-overlapping ranges. Are you handling if we start checking in the middle of a compound page and actually have to lookup the head to figure out if movable or not? > > has_unmovable_pages() is moved to mm/page_isolation.c since it is only > used by page isolation. Please move that into a separate patch upfront, makes this patch much easier to review. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.
On 13.01.22 16:46, Zi Yan wrote: > On 12 Jan 2022, at 6:04, David Hildenbrand wrote: > >> On 05.01.22 22:47, Zi Yan wrote: >>> From: Zi Yan >>> >>> alloc_migration_target() is used by alloc_contig_range() and non-LRU >>> movable compound pages can be migrated. Current code does not allocate the >>> right page size for such pages. Check THP precisely using >>> is_transparent_huge() and add allocation support for non-LRU compound >>> pages. >> >> IIRC, we don't have any non-lru migratable pages that are coumpound >> pages. Read: not used and not supported :) > > OK, but nothing prevents one writing a driver that allocates compound > pages and provides address_space->migratepage() and > address_space->isolate_page(). > > Actually, to test this series, I write a kernel module that allocates > an order-10 page, gives it a fake address_space with migratepage() and > isolate_page(), __SetPageMovable() on it, then call alloc_contig_range() > on the page range. Apparently, my kernel module is not supported by > the kernel, thus, I added this patch. > > Do you have an alternative test to my kernel module, so that I do not > even need this patch myself? > >> Why is this required in the context of this series? > > It might not be required. I will drop it. That's why I think it would be best dropping it. If you need it in different context, better submit it in different context. Makes this series easier to digest :) -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.
On 13.01.22 12:36, Mike Rapoport wrote: > On Wed, Jan 12, 2022 at 11:54:49AM +0100, David Hildenbrand wrote: >> On 05.01.22 22:47, Zi Yan wrote: >>> From: Zi Yan >>> >>> This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance. >>> It prepares for the upcoming removal of the MAX_ORDER-1 alignment >>> requirement for CMA and alloc_contig_range(). >>> >>> MIGRARTE_HIGHATOMIC should not merge with other migratetypes like >>> MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too. >>> Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness. >>> >>> [1] >>> https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/ >>> >>> Signed-off-by: Zi Yan >>> --- >>> include/linux/mmzone.h | 6 ++ >>> mm/page_alloc.c| 28 ++-- >>> 2 files changed, 24 insertions(+), 10 deletions(-) >>> > > ... > >>> @@ -3545,8 +3553,8 @@ int __isolate_free_page(struct page *page, unsigned >>> int order) >>> struct page *endpage = page + (1 << order) - 1; >>> for (; page < endpage; page += pageblock_nr_pages) { >>> int mt = get_pageblock_migratetype(page); >>> - if (!is_migrate_isolate(mt) && !is_migrate_cma(mt) >>> - && !is_migrate_highatomic(mt)) >>> + /* Only change normal pageblock */ >>> + if (migratetype_has_fallback(mt)) >>> set_pageblock_migratetype(page, >>> MIGRATE_MOVABLE); >>> } >> >> That part is a nice cleanup IMHO. Although the "has fallback" part is a >> bit imprecise. "migratetype_is_mergable()" might be a bit clearer. >> ideally "migratetype_is_mergable_with_other_types()". Can we come up >> with a nice name for that? > > migratetype_is_mergable() kinda implies "_with_other_types", no? > > I like migratetype_is_mergable() more than _has_fallback(). > > My $0.02 to bikeshedding :) :) Yeah, for me migratetype_is_mergable() would also be good enough. I think I was at first thinking one could mistake it with a dedicated migratetype. But such functions are historically called is_migrate_cma/is_migrate_cma/ -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.
On 05.01.22 22:47, Zi Yan wrote: > From: Zi Yan > > alloc_migration_target() is used by alloc_contig_range() and non-LRU > movable compound pages can be migrated. Current code does not allocate the > right page size for such pages. Check THP precisely using > is_transparent_huge() and add allocation support for non-LRU compound > pages. IIRC, we don't have any non-lru migratable pages that are coumpound pages. Read: not used and not supported :) Why is this required in the context of this series? -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 2/8] mm: compaction: handle non-lru compound pages properly in isolate_migratepages_block().
On 05.01.22 22:47, Zi Yan wrote: > From: Zi Yan > > In isolate_migratepages_block(), a !PageLRU tail page can be encountered > when the page is larger than a pageblock. Use compound head page for the > checks inside and skip the entire compound page when isolation succeeds. > This will currently never happen, due to the way we always isolate MAX_ORDER -1 ranges, correct? Better note that in the patch description, because currently it reads like it's an actual fix "can be encountered". > Signed-off-by: Zi Yan > --- > mm/compaction.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index b4e94cda3019..ad9053fbbe06 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -979,19 +979,23 @@ isolate_migratepages_block(struct compact_control *cc, > unsigned long low_pfn, >* Skip any other type of page >*/ > if (!PageLRU(page)) { > + struct page *head = compound_head(page); > /* >* __PageMovable can return false positive so we need >* to verify it under page_lock. >*/ > - if (unlikely(__PageMovable(page)) && > - !PageIsolated(page)) { > + if (unlikely(__PageMovable(head)) && > + !PageIsolated(head)) { > if (locked) { > unlock_page_lruvec_irqrestore(locked, > flags); > locked = NULL; > } > > - if (!isolate_movable_page(page, isolate_mode)) > + if (!isolate_movable_page(head, isolate_mode)) { > + low_pfn += (1 << compound_order(head)) > - 1 - (page - head); > + page = head; > goto isolate_success; > + } > } > > goto isolate_fail; -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.
On 05.01.22 22:47, Zi Yan wrote: > From: Zi Yan > > This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance. > It prepares for the upcoming removal of the MAX_ORDER-1 alignment > requirement for CMA and alloc_contig_range(). > > MIGRARTE_HIGHATOMIC should not merge with other migratetypes like > MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too. > Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness. > > [1] > https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/ > > Signed-off-by: Zi Yan > --- > include/linux/mmzone.h | 6 ++ > mm/page_alloc.c| 28 ++-- > 2 files changed, 24 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index aed44e9b5d89..0aa549653e4e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -83,6 +83,12 @@ static inline bool is_migrate_movable(int mt) > return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE; > } > > +/* See fallbacks[MIGRATE_TYPES][3] in page_alloc.c */ > +static inline bool migratetype_has_fallback(int mt) > +{ > + return mt < MIGRATE_PCPTYPES; > +} > + > #define for_each_migratetype_order(order, type) \ > for (order = 0; order < MAX_ORDER; order++) \ > for (type = 0; type < MIGRATE_TYPES; type++) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8dd6399bafb5..5193c953dbf8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1042,6 +1042,12 @@ buddy_merge_likely(unsigned long pfn, unsigned long > buddy_pfn, > return page_is_buddy(higher_page, higher_buddy, order + 1); > } > > +static inline bool has_non_fallback_pageblock(struct zone *zone) > +{ > + return has_isolate_pageblock(zone) || zone_cma_pages(zone) != 0 || > + zone->nr_reserved_highatomic != 0; > +} Due to zone_cma_pages(), the unlikely() below will be very wrong on many setups. Previously, isolation really was a corner case. CMA and highatomic are less of a corner case ... I'm not even sure if this check is worth having around anymore at all, or if it would be easier and cheaper to just always check the both migration types unconditionally. Would certainly simplify the code. Side node: we actually care about has_free_non_fallback_pageblock(), we can only merge with free pageblocks. But that might not necessarily be cheaper to test/track/check. > + > /* > * Freeing function for a buddy system allocator. > * > @@ -1117,14 +1123,15 @@ static inline void __free_one_page(struct page *page, > } > if (order < MAX_ORDER - 1) { > /* If we are here, it means order is >= pageblock_order. > - * We want to prevent merge between freepages on isolate > - * pageblock and normal pageblock. Without this, pageblock > - * isolation could cause incorrect freepage or CMA accounting. > + * We want to prevent merge between freepages on pageblock > + * without fallbacks and normal pageblock. Without this, > + * pageblock isolation could cause incorrect freepage or CMA > + * accounting or HIGHATOMIC accounting. >* >* We don't want to hit this code for the more frequent >* low-order merging. >*/ > - if (unlikely(has_isolate_pageblock(zone))) { > + if (unlikely(has_non_fallback_pageblock(zone))) { > int buddy_mt; > > buddy_pfn = __find_buddy_pfn(pfn, order); > @@ -1132,8 +1139,8 @@ static inline void __free_one_page(struct page *page, > buddy_mt = get_pageblock_migratetype(buddy); > > if (migratetype != buddy_mt > - && (is_migrate_isolate(migratetype) || > - is_migrate_isolate(buddy_mt))) > + && > (!migratetype_has_fallback(migratetype) || > + > !migratetype_has_fallback(buddy_mt))) > goto done_merging; > } > max_order = order + 1; > @@ -2484,6 +2491,7 @@ static int fallbacks[MIGRATE_TYPES][3] = { > [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, > MIGRATE_TYPES }, > [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, > MIGRATE_TYPES }, > [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, > MIGRATE_TYPES }, > + [MIGRATE_HIGHATOMIC] = { MIGRATE_TYPES }, /* Never used */ > #ifdef CONFIG_CMA > [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */ > #endif > @@ -2795,8 +2803,8 @@ static void reserve_highatomic_pageblock(struct page > *page, struct zone *zone, > > /* Yoink! */ > mt = get_pageblock_migratetype(page); > - if (!is_migrate_highatomic(mt) &&
Re: [RFC PATCH v2 0/7] Use pageblock_order for cma and alloc_contig_range alignment.
On 10.12.21 00:04, Zi Yan wrote: > From: Zi Yan > > Hi all, Hi, thanks for working on that! > > This patchset tries to remove the MAX_ORDER - 1 alignment requirement for CMA > and alloc_contig_range(). It prepares for my upcoming changes to make > MAX_ORDER > adjustable at boot time[1]. > > The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range() > isolates pageblocks to remove free memory from buddy allocator but isolating > only a subset of pageblocks within a page spanning across multiple pageblocks > causes free page accounting issues. Isolated page might not be put into the > right free list, since the code assumes the migratetype of the first pageblock > as the whole free page migratetype. This is based on the discussion at [2]. > > To remove the requirement, this patchset: > 1. still isolates pageblocks at MAX_ORDER - 1 granularity; > 2. but saves the pageblock migratetypes outside the specified range of >alloc_contig_range() and restores them after all pages within the range >become free after __alloc_contig_migrate_range(); > 3. splits free pages spanning multiple pageblocks at the beginning and the end >of the range and puts the split pages to the right migratetype free lists >based on the pageblock migratetypes; > 4. returns pages not in the range as it did before this patch. > > Isolation needs to happen at MAX_ORDER - 1 granularity, because otherwise > 1) extra code is needed to detect pages (free, PageHuge, THP, or PageCompound) > to make sure all pageblocks belonging to a single page are isolated together > and later pageblocks outside the range need to have their migratetypes > restored; > or 2) extra logic will need to be added during page free time to split a free > page with multi-migratetype pageblocks. > > Two optimizations might come later: > 1. only check unmovable pages within the range instead of MAX_ORDER - 1 > aligned >range during isolation to increase successful rate of alloc_contig_range(). The issue with virtio-mem is that we'll need that as soon as we change the granularity to pageblocks, because otherwise, you can heavily degrade unplug reliably in sane setups: Previous: * Try unplug free 4M range (2 pageblocks): succeeds Now: * Try unplug 2M range (first pageblock): succeeds. * Try unplug next 2M range (second pageblock): fails because first contains unmovable allcoations. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 17.11.21 04:04, Zi Yan wrote: > On 16 Nov 2021, at 3:58, David Hildenbrand wrote: > >> On 15.11.21 20:37, Zi Yan wrote: >>> From: Zi Yan >>> >>> Hi David, >> >> Hi, >> >> thanks for looking into this. >> Hi, sorry for the delay, I wasn't "actually working" last week, so now I'm back from holiday :) >>> >>> You suggested to make alloc_contig_range() deal with pageblock_order >>> instead of >>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This >>> patchset is my attempt to achieve that. Please take a look and let me know >>> if >>> I am doing it correctly or not. >>> >>> From what my understanding, cma required alignment of >>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was >>> introduced, >>> __free_one_page() does not prevent merging two different pageblocks, when >>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() >>> implementation >>> does prevent that. It should be OK to just align cma to pageblock_order. >>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use >>> pageblock_order as alignment too. >> >> I wonder if that's sufficient. Especially the outer_start logic in >> alloc_contig_range() might be problematic. There are some ugly corner >> cases with free pages/allocations spanning multiple pageblocks and we >> only isolated a single pageblock. > > Thank you a lot for writing the list of these corner cases. They are > very helpful! > >> >> >> Regarding CMA, we have to keep the following cases working: >> >> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER >>- 1 page: >>[ MAX_ ORDER - 1 ] >>[ pageblock 0 | pageblock 1] >> >> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, >> but not both. We have to make sure alloc_contig_range() keeps working >> correctly. This should be the case even with your change, as we won't >> merging pages accross differing migratetypes. > > Yes. > >> >> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: >>[ MAX_ ORDER - 1 ] >>[ pageblock 0 | pageblock 1] >> >> Assume both are MIGRATE_CMA. Assume we want to either allocate from >> pageblock 0 or pageblock 1. Especially, assume we want to allocate from >> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate >> pageblock 0. >> >> What happens if we either have a free page spanning the MAX_ORDER - 1 >> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting >> in a free MAX_ORDER - 1 page of which only the second pageblock is >> isolated? We would end up essentially freeing a page that has mixed >> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I >> might be wrong but I have the feeling that this would be problematic. >> > > This could happen when start_isolate_page_range() stumbles upon a compound > page with order >= pageblock_order or a free page with order >= > pageblock_order, but should not. start_isolate_page_range() should check > the actual page size, either compound page size or free page size, and set > the migratetype across pageblocks if the page is bigger than pageblock size. > More precisely set_migratetype_isolate() should do that. Right -- but then we have to extend the isolation range from within set_migratetype_isolate() :/ E.g., how should we know what we have to unisolate later .. > > >> c) Concurrent allocations: >> [ MAX_ ORDER - 1 ] >> [ pageblock 0 | pageblock 1] >> >> Assume b) but we have two concurrent CMA allocations to pageblock 0 and >> pageblock 1, which would now be possible as start_isolate_page_range() >> isolate would succeed on both. > > Two isolations will be serialized by the zone lock taken by > set_migratetype_isolate(), so the concurrent allocation would not be a > problem. > If it is a MAX_ORDER-1 free page, the first comer should split it and only > isolate one of the pageblock then second one can isolate the other pageblock. Right, the issue is essentially b) above. > If it is a MAX_ORDER-1 compound page, the first comer should isolate both > pageblocks, then the second one would fail. WDYT? Possibly we could even have two independent CMA areas "colliding" within a MAX_ ORDER - 1 page. I guess the surprise would be getting an "-EAGAIN" from isolation, but the caller should properly handle that. Maybe it's really easier to
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 15.11.21 20:37, Zi Yan wrote: > From: Zi Yan > > Hi David, Hi, thanks for looking into this. > > You suggested to make alloc_contig_range() deal with pageblock_order instead > of > MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This > patchset is my attempt to achieve that. Please take a look and let me know if > I am doing it correctly or not. > > From what my understanding, cma required alignment of > max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, > __free_one_page() does not prevent merging two different pageblocks, when > MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation > does prevent that. It should be OK to just align cma to pageblock_order. > alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use > pageblock_order as alignment too. I wonder if that's sufficient. Especially the outer_start logic in alloc_contig_range() might be problematic. There are some ugly corner cases with free pages/allocations spanning multiple pageblocks and we only isolated a single pageblock. Regarding CMA, we have to keep the following cases working: a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, but not both. We have to make sure alloc_contig_range() keeps working correctly. This should be the case even with your change, as we won't merging pages accross differing migratetypes. b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume both are MIGRATE_CMA. Assume we want to either allocate from pageblock 0 or pageblock 1. Especially, assume we want to allocate from pageblock 1. While we would isolate pageblock 1, we wouldn't isolate pageblock 0. What happens if we either have a free page spanning the MAX_ORDER - 1 range already OR if we have to migrate a MAX_ORDER - 1 page, resulting in a free MAX_ORDER - 1 page of which only the second pageblock is isolated? We would end up essentially freeing a page that has mixed pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I might be wrong but I have the feeling that this would be problematic. c) Concurrent allocations: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume b) but we have two concurrent CMA allocations to pageblock 0 and pageblock 1, which would now be possible as start_isolate_page_range() isolate would succeed on both. Regarding virtio-mem, we care about the following cases: a) Allocating parts from completely movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 and pageblock 1 are either free or contain only movable pages. Assume we allocated pageblock 0. We have to make sure we can allocate pageblock 1. The other way around, assume we allocated pageblock 1, we have to make sure we can allocate pageblock 0. Free pages spanning both pageblocks might be problematic. b) Allocate parts of partially movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 contains unmovable data but pageblock 1 not: we have to make sure we can allocate pageblock 1. Similarly, assume pageblock 1 contains unmovable data but pageblock 0 no: we have to make sure we can allocate pageblock 1. has_unmovable_pages() might allow for that. But, we want to fail early in case we want to allocate a single pageblock but it contains unmovable data. This could be either directly or indirectly. If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating pageblock 1, has_unmovable_pages() would always return "false" because we'd simply be skiping over any tail pages, and not detect the un-movability. c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: Same concern as for CMA b) So the biggest concern I have is dealing with migrating/freeing > pageblock_order pages while only having isolated a single pageblock. > > In terms of virtio_mem, if I understand correctly, it relies on > alloc_contig_range() to obtain contiguous free pages and offlines them to > reduce > guest memory size. As the result of alloc_contig_range() alignment change, > virtio_mem should be able to just align PFNs to pageblock_order. For virtio-mem it will most probably be desirable to first try allocating the MAX_ORDER -1 range if possible and then fallback to pageblock_order. But that's an additional change on top in virtio-mem code. My take to teach alloc_contig_range() to properly handle would be the following: a) Convert MIGRATE_ISOLATE into a separate pageblock flag We would want to convert MIGRATE_ISOLATE into a separate pageblock flags, such that when we isolate a page block we preserve the original migratetype. While start_isolate_page_range()
Re: [PATCH RFC] virtio: wrap config->reset calls
On 13.10.21 14:17, Michael S. Tsirkin wrote: > On Wed, Oct 13, 2021 at 01:03:46PM +0200, David Hildenbrand wrote: >> On 13.10.21 12:55, Michael S. Tsirkin wrote: >>> This will enable cleanups down the road. >>> The idea is to disable cbs, then add "flush_queued_cbs" callback >>> as a parameter, this way drivers can flush any work >>> queued after callbacks have been disabled. >>> >>> Signed-off-by: Michael S. Tsirkin >>> --- >>> arch/um/drivers/virt-pci.c | 2 +- >>> drivers/block/virtio_blk.c | 4 ++-- >>> drivers/bluetooth/virtio_bt.c | 2 +- >>> drivers/char/hw_random/virtio-rng.c| 2 +- >>> drivers/char/virtio_console.c | 4 ++-- >>> drivers/crypto/virtio/virtio_crypto_core.c | 8 >>> drivers/firmware/arm_scmi/virtio.c | 2 +- >>> drivers/gpio/gpio-virtio.c | 2 +- >>> drivers/gpu/drm/virtio/virtgpu_kms.c | 2 +- >>> drivers/i2c/busses/i2c-virtio.c| 2 +- >>> drivers/iommu/virtio-iommu.c | 2 +- >>> drivers/net/caif/caif_virtio.c | 2 +- >>> drivers/net/virtio_net.c | 4 ++-- >>> drivers/net/wireless/mac80211_hwsim.c | 2 +- >>> drivers/nvdimm/virtio_pmem.c | 2 +- >>> drivers/rpmsg/virtio_rpmsg_bus.c | 2 +- >>> drivers/scsi/virtio_scsi.c | 2 +- >>> drivers/virtio/virtio.c| 5 + >>> drivers/virtio/virtio_balloon.c| 2 +- >>> drivers/virtio/virtio_input.c | 2 +- >>> drivers/virtio/virtio_mem.c| 2 +- >>> fs/fuse/virtio_fs.c| 4 ++-- >>> include/linux/virtio.h | 1 + >>> net/9p/trans_virtio.c | 2 +- >>> net/vmw_vsock/virtio_transport.c | 4 ++-- >>> sound/virtio/virtio_card.c | 4 ++-- >>> 26 files changed, 39 insertions(+), 33 deletions(-) >>> >>> diff --git a/arch/um/drivers/virt-pci.c b/arch/um/drivers/virt-pci.c >>> index c08066633023..22c4d87c9c15 100644 >>> --- a/arch/um/drivers/virt-pci.c >>> +++ b/arch/um/drivers/virt-pci.c >>> @@ -616,7 +616,7 @@ static void um_pci_virtio_remove(struct virtio_device >>> *vdev) >>> int i; >>> /* Stop all virtqueues */ >>> -vdev->config->reset(vdev); >>> +virtio_reset_device(vdev); >>> vdev->config->del_vqs(vdev); >> >> Nit: virtio_device_reset()? >> >> Because I see: >> >> int virtio_device_freeze(struct virtio_device *dev); >> int virtio_device_restore(struct virtio_device *dev); >> void virtio_device_ready(struct virtio_device *dev) >> >> But well, there is: >> void virtio_break_device(struct virtio_device *dev); > > Exactly. I don't know what's best, so I opted for plain english :) Fair enough, LGTM Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC] virtio: wrap config->reset calls
On 13.10.21 12:55, Michael S. Tsirkin wrote: This will enable cleanups down the road. The idea is to disable cbs, then add "flush_queued_cbs" callback as a parameter, this way drivers can flush any work queued after callbacks have been disabled. Signed-off-by: Michael S. Tsirkin --- arch/um/drivers/virt-pci.c | 2 +- drivers/block/virtio_blk.c | 4 ++-- drivers/bluetooth/virtio_bt.c | 2 +- drivers/char/hw_random/virtio-rng.c| 2 +- drivers/char/virtio_console.c | 4 ++-- drivers/crypto/virtio/virtio_crypto_core.c | 8 drivers/firmware/arm_scmi/virtio.c | 2 +- drivers/gpio/gpio-virtio.c | 2 +- drivers/gpu/drm/virtio/virtgpu_kms.c | 2 +- drivers/i2c/busses/i2c-virtio.c| 2 +- drivers/iommu/virtio-iommu.c | 2 +- drivers/net/caif/caif_virtio.c | 2 +- drivers/net/virtio_net.c | 4 ++-- drivers/net/wireless/mac80211_hwsim.c | 2 +- drivers/nvdimm/virtio_pmem.c | 2 +- drivers/rpmsg/virtio_rpmsg_bus.c | 2 +- drivers/scsi/virtio_scsi.c | 2 +- drivers/virtio/virtio.c| 5 + drivers/virtio/virtio_balloon.c| 2 +- drivers/virtio/virtio_input.c | 2 +- drivers/virtio/virtio_mem.c| 2 +- fs/fuse/virtio_fs.c| 4 ++-- include/linux/virtio.h | 1 + net/9p/trans_virtio.c | 2 +- net/vmw_vsock/virtio_transport.c | 4 ++-- sound/virtio/virtio_card.c | 4 ++-- 26 files changed, 39 insertions(+), 33 deletions(-) diff --git a/arch/um/drivers/virt-pci.c b/arch/um/drivers/virt-pci.c index c08066633023..22c4d87c9c15 100644 --- a/arch/um/drivers/virt-pci.c +++ b/arch/um/drivers/virt-pci.c @@ -616,7 +616,7 @@ static void um_pci_virtio_remove(struct virtio_device *vdev) int i; /* Stop all virtqueues */ -vdev->config->reset(vdev); +virtio_reset_device(vdev); vdev->config->del_vqs(vdev); Nit: virtio_device_reset()? Because I see: int virtio_device_freeze(struct virtio_device *dev); int virtio_device_restore(struct virtio_device *dev); void virtio_device_ready(struct virtio_device *dev) But well, there is: void virtio_break_device(struct virtio_device *dev); -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource
On 30.09.21 03:30, Mike Rapoport wrote: From: Mike Rapoport dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. As dma_map_resource() is the only method in DMA mapping APIs that has this check, simply drop the pfn_valid() test from dma_map_resource(). Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/ Signed-off-by: Mike Rapoport --- Acked-by: David Hildenbrand -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: An cma optimization patch is used for cma_[alloc|free].
On 13.08.21 09:00, Jichao Zou wrote: Pre-allocate CMA memory that configured in device tree, this greatly improves the CMA memory allocation efficiency, cma_[alloc|free] is less than 1ms, old way is took a few ms to tens or hundreds ms. Please send patches as proper emails (man git-format-patch; man git-send-email). What you propose is turning cma reservations into something comparable to permanent boottime allocations. From the POV of the buddy, the pages are always allocated and cannot be repurposed for e.g., movable allocations until *actually* allocated via CMA. I don't think we want this behavior upstream. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA
On 12.04.21 15:12, Robin Murphy wrote: On 2021-04-09 14:39, David Hildenbrand wrote: On 09.04.21 15:35, Arnd Bergmann wrote: On Fri, Apr 9, 2021 at 1:21 PM David Hildenbrand wrote: Random drivers should not override a user configuration of core knobs (e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA, which depends on CMA, if possible; however, these drivers also have to tolerate if DMA_CMA is not available/functioning, for example, if no CMA area for DMA_CMA use has been setup via "cma=X". In the worst case, the driver cannot do it's job properly in some configurations. For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and DMA_CMA if available") documents While this is no build dependency, etnaviv will only work correctly on most systems if CMA and DMA_CMA are enabled. Select both options if available to avoid users ending up with a non-working GPU due to a lacking kernel config. So etnaviv really wants to have DMA_CMA, however, can deal with some cases where it is not available. Let's introduce WANT_DMA_CMA and use it in most cases where drivers select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because of recursive dependency issues). We'll assume that any driver that selects DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible. With this change, distributions can disable CONFIG_CMA or CONFIG_DMA_CMA, without it silently getting enabled again by random drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA and CONFIG_DMA_CMA if they are unspecified and any driver is around that selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER. For example, if any driver selects WANT_DMA_CMA and we do a "make olddefconfig": 1. With "# CONFIG_CMA is not set" and no specification of "CONFIG_DMA_CMA" -> CONFIG_DMA_CMA won't be part of .config 2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA Contiguous Memory Allocator (CMA) [Y/n/?] (NEW) DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW) 3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set" -> CONFIG_DMA_CMA will be removed from .config Note: drivers/remoteproc seems to be special; commit c51e882cd711 ("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains that there is a real dependency to DMA_CMA for it to work; leave that dependency in place and don't convert it to a soft dependency. I don't think this dependency is fundamentally different from the others, though davinci machines tend to have less memory than a lot of the other machines, so it's more likely to fail without CMA. I was also unsure - and Lucas had similar thoughts. If you want, I can send a v4 also taking care of this. TBH I think it should all just be removed. DMA_CMA is effectively an internal feature of the DMA API, and drivers which simply use the DMA API shouldn't really be trying to assume *how* things might be allocated at runtime - CMA is hardly the only way. Platform-level assumptions about the presence or not of IOMMUs, memory carveouts, etc., and whether it even matters - e.g. a device with a tiny LCD may only need display buffers which still fit in a regular MAX_ORDER allocation - could go in platform-specific configs, but I really don't think they belong at the generic subsystem level. We already have various examples like I2S drivers that won't even probe without a dmaengine provider being present, or host controller drivers which are useless without their corresponding phy driver (and I'm guessing you can probably also do higher-level things like include the block layer but omit all filesystem drivers). I don't believe it's Kconfig's job to try to guess whether a given configuration is *useful*, only to enforce that's it's valid to build. That would mean: if it's not a built-time dependency, don't mention it in Kconfig. If that were true, why do we have have defaults modeled in Kconfig then? IMHO, some part of Kconfig is to give you sane defaults. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA
On 09.04.21 15:35, Arnd Bergmann wrote: On Fri, Apr 9, 2021 at 1:21 PM David Hildenbrand wrote: Random drivers should not override a user configuration of core knobs (e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA, which depends on CMA, if possible; however, these drivers also have to tolerate if DMA_CMA is not available/functioning, for example, if no CMA area for DMA_CMA use has been setup via "cma=X". In the worst case, the driver cannot do it's job properly in some configurations. For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and DMA_CMA if available") documents While this is no build dependency, etnaviv will only work correctly on most systems if CMA and DMA_CMA are enabled. Select both options if available to avoid users ending up with a non-working GPU due to a lacking kernel config. So etnaviv really wants to have DMA_CMA, however, can deal with some cases where it is not available. Let's introduce WANT_DMA_CMA and use it in most cases where drivers select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because of recursive dependency issues). We'll assume that any driver that selects DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible. With this change, distributions can disable CONFIG_CMA or CONFIG_DMA_CMA, without it silently getting enabled again by random drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA and CONFIG_DMA_CMA if they are unspecified and any driver is around that selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER. For example, if any driver selects WANT_DMA_CMA and we do a "make olddefconfig": 1. With "# CONFIG_CMA is not set" and no specification of "CONFIG_DMA_CMA" -> CONFIG_DMA_CMA won't be part of .config 2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA Contiguous Memory Allocator (CMA) [Y/n/?] (NEW) DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW) 3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set" -> CONFIG_DMA_CMA will be removed from .config Note: drivers/remoteproc seems to be special; commit c51e882cd711 ("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains that there is a real dependency to DMA_CMA for it to work; leave that dependency in place and don't convert it to a soft dependency. I don't think this dependency is fundamentally different from the others, though davinci machines tend to have less memory than a lot of the other machines, so it's more likely to fail without CMA. I was also unsure - and Lucas had similar thoughts. If you want, I can send a v4 also taking care of this. Thanks! Regardless of this, Reviewed-by: Arnd Bergmann -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA
Random drivers should not override a user configuration of core knobs (e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA, which depends on CMA, if possible; however, these drivers also have to tolerate if DMA_CMA is not available/functioning, for example, if no CMA area for DMA_CMA use has been setup via "cma=X". In the worst case, the driver cannot do it's job properly in some configurations. For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and DMA_CMA if available") documents While this is no build dependency, etnaviv will only work correctly on most systems if CMA and DMA_CMA are enabled. Select both options if available to avoid users ending up with a non-working GPU due to a lacking kernel config. So etnaviv really wants to have DMA_CMA, however, can deal with some cases where it is not available. Let's introduce WANT_DMA_CMA and use it in most cases where drivers select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because of recursive dependency issues). We'll assume that any driver that selects DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible. With this change, distributions can disable CONFIG_CMA or CONFIG_DMA_CMA, without it silently getting enabled again by random drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA and CONFIG_DMA_CMA if they are unspecified and any driver is around that selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or DRM_KMS_CMA_HELPER. For example, if any driver selects WANT_DMA_CMA and we do a "make olddefconfig": 1. With "# CONFIG_CMA is not set" and no specification of "CONFIG_DMA_CMA" -> CONFIG_DMA_CMA won't be part of .config 2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA Contiguous Memory Allocator (CMA) [Y/n/?] (NEW) DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW) 3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set" -> CONFIG_DMA_CMA will be removed from .config Note: drivers/remoteproc seems to be special; commit c51e882cd711 ("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains that there is a real dependency to DMA_CMA for it to work; leave that dependency in place and don't convert it to a soft dependency. Cc: Maarten Lankhorst Cc: Maxime Ripard Cc: Thomas Zimmermann Cc: David Airlie Cc: Daniel Vetter Cc: Joel Stanley Cc: Andrew Jeffery Cc: Lucas Stach Cc: Russell King Cc: Christian Gmeiner Cc: Paul Cercueil Cc: Linus Walleij Cc: Christoph Hellwig Cc: Marek Szyprowski Cc: Robin Murphy Cc: Andrew Morton Cc: Mike Rapoport Cc: Arnd Bergmann Cc: Bartlomiej Zolnierkiewicz Cc: Eric Anholt Cc: Michal Simek Cc: Masahiro Yamada Cc: "Alexander A. Klimov" Cc: Peter Collingbourne Cc: Suman Anna Cc: Jason Gunthorpe Cc: dri-de...@lists.freedesktop.org Cc: linux-asp...@lists.ozlabs.org Cc: linux-arm-ker...@lists.infradead.org Cc: etna...@lists.freedesktop.org Cc: linux-m...@vger.kernel.org Cc: linux-fb...@vger.kernel.org Cc: iommu@lists.linux-foundation.org Signed-off-by: David Hildenbrand --- Let's see if this approach is better for soft dependencies (and if we actually have some hard dependencies in there). This is the follow-up of https://lkml.kernel.org/r/20210408092011.52763-1-da...@redhat.com https://lkml.kernel.org/r/20210408100523.63356-1-da...@redhat.com I was wondering if it would make sense in some drivers to warn if either CONFIG_DMA_CMA is not available or if DRM_CMA has not been configured properly - just to give people a heads up that something might more likely go wrong; that would, however, be future work. v2 -> v3: - Don't use "imply" but instead use a new WANT_DMA_CMA and make the default of CMA and DMA_CMA depend on it. - Also adjust ingenic, mcde, tve200; these sound like soft dependencies as well (although DMA_CMA is really desired) v1 -> v2: - Fix DRM_CMA -> DMA_CMA --- drivers/gpu/drm/Kconfig | 2 ++ drivers/gpu/drm/aspeed/Kconfig | 2 -- drivers/gpu/drm/etnaviv/Kconfig | 3 +-- drivers/gpu/drm/ingenic/Kconfig | 1 - drivers/gpu/drm/mcde/Kconfig| 1 - drivers/gpu/drm/tve200/Kconfig | 1 - drivers/video/fbdev/Kconfig | 2 +- kernel/dma/Kconfig | 7 +++ mm/Kconfig | 1 + 9 files changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 85b79a7fee63..6f9989adfa93 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -201,12 +201,14 @@ config DRM_TTM_HELPER config DRM_GEM_CMA_HELPER bool depends on DRM + select WANT_DMA_CMA help Choose this if you need the GEM CMA helper functions config DRM_KMS_CMA_HELPER bool depends on DRM + select WANT_DMA_CMA select DRM_GEM_CMA_HELPER help Choose this if
Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE
But again, if there are valid use cases then sure, let's make the code fully compatible with HUGETLB_PAGE_ORDER > MAX_ORDER. Given that gigantic HugeTLB allocation can fallback on alloc_contig_pages() or CMA if/when available, is there a real need for HUGETLB_PAGE_ORDER to be upto MAX_ORDER, used as pageblock_order or as COMPACTION_HPAGE_ORDER ? With gigantic HugeTLB pages being available, HUGETLB_PAGE_ORDER seems to be just detached from the buddy allocator. But I am not sure, will keep looking. Having HPAGE_PMD_ORDER >>= MAX_ORDER ("significantly larger") will make it very unlikely that you are able to reliably allocate any huge pages at all dynamically at runtime without CMA. Gigantic pages are problematic by nature :) -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE
On 12.02.21 08:02, Anshuman Khandual wrote: On 2/11/21 2:07 PM, David Hildenbrand wrote: On 11.02.21 07:22, Anshuman Khandual wrote: The following warning gets triggered while trying to boot a 64K page size without THP config kernel on arm64 platform. WARNING: CPU: 5 PID: 124 at mm/vmstat.c:1080 __fragmentation_index+0xa4/0xc0 Modules linked in: CPU: 5 PID: 124 Comm: kswapd0 Not tainted 5.11.0-rc6-4-ga0ea7d62002 #159 Hardware name: linux,dummy-virt (DT) [ 8.810673] pstate: 2045 (nzCv daif +PAN -UAO -TCO BTYPE=--) [ 8.811732] pc : __fragmentation_index+0xa4/0xc0 [ 8.812555] lr : fragmentation_index+0xf8/0x138 [ 8.813360] sp : 864079b0 [ 8.813958] x29: 864079b0 x28: 0372 [ 8.814901] x27: 7682 x26: 8000135b3948 [ 8.815847] x25: 1fffe00010c80f48 x24: [ 8.816805] x23: x22: 000d [ 8.817764] x21: 0030 x20: 0005ffcb4d58 [ 8.818712] x19: 000b x18: [ 8.819656] x17: x16: [ 8.820613] x15: x14: 8000114c6258 [ 8.821560] x13: 6000bff969ba x12: 1fffe000bff969b9 [ 8.822514] x11: 1fffe000bff969b9 x10: 6000bff969b9 [ 8.823461] x9 : dfff8000 x8 : 0005ffcb4dcf [ 8.824415] x7 : 0001 x6 : 41b58ab3 [ 8.825359] x5 : 600010c80f48 x4 : dfff8000 [ 8.826313] x3 : 8000102be670 x2 : 0007 [ 8.827259] x1 : 86407a60 x0 : 000d [ 8.828218] Call trace: [ 8.828667] __fragmentation_index+0xa4/0xc0 [ 8.829436] fragmentation_index+0xf8/0x138 [ 8.830194] compaction_suitable+0x98/0xb8 [ 8.830934] wakeup_kcompactd+0xdc/0x128 [ 8.831640] balance_pgdat+0x71c/0x7a0 [ 8.832327] kswapd+0x31c/0x520 [ 8.832902] kthread+0x224/0x230 [ 8.833491] ret_from_fork+0x10/0x30 [ 8.834150] ---[ end trace 472836f79c15516b ]--- This warning comes from __fragmentation_index() when the requested order is greater than MAX_ORDER. static int __fragmentation_index(unsigned int order, struct contig_page_info *info) { unsigned long requested = 1UL << order; if (WARN_ON_ONCE(order >= MAX_ORDER)) <= Triggered here return 0; Digging it further reveals that pageblock_order has been assigned a value which is greater than MAX_ORDER failing the above check. But why this happened ? Because HUGETLB_PAGE_ORDER for the given config on arm64 is greater than MAX_ORDER. The solution involves enabling HUGETLB_PAGE_SIZE_VARIABLE which would make pageblock_order a variable instead of constant HUGETLB_PAGE_ORDER. But that change alone also did not really work as pageblock_order still got assigned as HUGETLB_PAGE_ORDER in set_pageblock_order(). HUGETLB_PAGE_ORDER needs to be less than MAX_ORDER for its appropriateness as pageblock_order otherwise just fallback to MAX_ORDER - 1 as before. While here it also fixes a build problem via type casting MAX_ORDER in rmem_cma_setup(). I'm wondering, is there any real value in allowing FORCE_MAX_ZONEORDER to be "11" with ARM64_64K_PAGES/ARM64_16K_PAGES? MAX_ORDER should be as high as would be required for the current config. Unless THP is enabled, there is no need for it to be any higher than 11. But I might be missing historical reasons around this as well. Probably others from arm64 could help here. Theoretically yes, practically no. If nobody cares about a configuration, no need to make the code more complicated for that configuration. Meaning: are there any real use cases that actually build a kernel without TRANSPARENT_HUGEPAGE and with ARM64_64K_PAGES/ARM64_16K_PAGES? THP is always optional. Besides kernel builds without THP should always be supported. Assuming that all builds will have THP enabled, might not be accurate. As builds are essentially broken, I assume this is not that relevant? Or how long has it been broken? Git blame shows that it's been there for some time now. But how does that make this irrelevant ? A problem should be fixed nonetheless. When exactly did I say not to fix it? I'm saying if nobody uses it, we might be able to simplify. It might be easier to just drop the "TRANSPARENT_HUGEPAGE" part from the FORCE_MAX_ZONEORDER config. Not sure if it would be a good idea to unnecessarily have larger MAX_ORDER value for a given config. But I might be missing other contexts here. My point is: keep it simple if there is no need to make it complicated. If these arm64 variants are the only cases where we run into that issue and nobody uses them ("hat it's been there for some time now"), why make stuff complicated? The current code seems to assume that HUGETLB_PAGE_ORDER <= MAX_ORDER. Instead of changing that for optimizing an unused use case (it is broken), just simplify the arm64 conditions.
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Again in proper SVA it should be quite unlikely to take a fault caused by something like migration, on the same likelyhood as the CPU. If things are faulting so much this is a problem then I think it is a system level problem with doing too much page motion. My point is that single one SVA application shouldn't require system to make global changes, such as disabling numa balancing, disabling THP, to decrease page fault frequency by affecting other applications. Anyway, guys are in lunar new year. Hopefully, we are getting more real benchmark data afterwards to make the discussion more targeted. Right, but I think functionality as proposed in this patch is highly unlikely to make it into the kernel. I'd be interested in approaches to mitigate this per process. E.g., temporarily stop the kernel from messing with THP of this specific process. But even then, why should some random library make such decisions for a whole process? Just as, why should some random library pin pages never allocated by it and stop THP from being created or NUMA layout from getting optimized? This looks like a clear layer violation to me. I fully agree with Jason: Why do the events happen that often such that your use cases are affected that heavily, such that we even need such ugly handling? What mempinfd does is exposing dangerous functionality that we don't want 99.6% of all user space to ever use via a syscall to generic users to fix broken* hw. *broken might be over-stressing the situation, but the HW (SVA) obviously seems to perform way worse than ordinary CPUs. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE
On 11.02.21 07:22, Anshuman Khandual wrote: The following warning gets triggered while trying to boot a 64K page size without THP config kernel on arm64 platform. WARNING: CPU: 5 PID: 124 at mm/vmstat.c:1080 __fragmentation_index+0xa4/0xc0 Modules linked in: CPU: 5 PID: 124 Comm: kswapd0 Not tainted 5.11.0-rc6-4-ga0ea7d62002 #159 Hardware name: linux,dummy-virt (DT) [8.810673] pstate: 2045 (nzCv daif +PAN -UAO -TCO BTYPE=--) [8.811732] pc : __fragmentation_index+0xa4/0xc0 [8.812555] lr : fragmentation_index+0xf8/0x138 [8.813360] sp : 864079b0 [8.813958] x29: 864079b0 x28: 0372 [8.814901] x27: 7682 x26: 8000135b3948 [8.815847] x25: 1fffe00010c80f48 x24: [8.816805] x23: x22: 000d [8.817764] x21: 0030 x20: 0005ffcb4d58 [8.818712] x19: 000b x18: [8.819656] x17: x16: [8.820613] x15: x14: 8000114c6258 [8.821560] x13: 6000bff969ba x12: 1fffe000bff969b9 [8.822514] x11: 1fffe000bff969b9 x10: 6000bff969b9 [8.823461] x9 : dfff8000 x8 : 0005ffcb4dcf [8.824415] x7 : 0001 x6 : 41b58ab3 [8.825359] x5 : 600010c80f48 x4 : dfff8000 [8.826313] x3 : 8000102be670 x2 : 0007 [8.827259] x1 : 86407a60 x0 : 000d [8.828218] Call trace: [8.828667] __fragmentation_index+0xa4/0xc0 [8.829436] fragmentation_index+0xf8/0x138 [8.830194] compaction_suitable+0x98/0xb8 [8.830934] wakeup_kcompactd+0xdc/0x128 [8.831640] balance_pgdat+0x71c/0x7a0 [8.832327] kswapd+0x31c/0x520 [8.832902] kthread+0x224/0x230 [8.833491] ret_from_fork+0x10/0x30 [8.834150] ---[ end trace 472836f79c15516b ]--- This warning comes from __fragmentation_index() when the requested order is greater than MAX_ORDER. static int __fragmentation_index(unsigned int order, struct contig_page_info *info) { unsigned long requested = 1UL << order; if (WARN_ON_ONCE(order >= MAX_ORDER)) <= Triggered here return 0; Digging it further reveals that pageblock_order has been assigned a value which is greater than MAX_ORDER failing the above check. But why this happened ? Because HUGETLB_PAGE_ORDER for the given config on arm64 is greater than MAX_ORDER. The solution involves enabling HUGETLB_PAGE_SIZE_VARIABLE which would make pageblock_order a variable instead of constant HUGETLB_PAGE_ORDER. But that change alone also did not really work as pageblock_order still got assigned as HUGETLB_PAGE_ORDER in set_pageblock_order(). HUGETLB_PAGE_ORDER needs to be less than MAX_ORDER for its appropriateness as pageblock_order otherwise just fallback to MAX_ORDER - 1 as before. While here it also fixes a build problem via type casting MAX_ORDER in rmem_cma_setup(). I'm wondering, is there any real value in allowing FORCE_MAX_ZONEORDER to be "11" with ARM64_64K_PAGES/ARM64_16K_PAGES? Meaning: are there any real use cases that actually build a kernel without TRANSPARENT_HUGEPAGE and with ARM64_64K_PAGES/ARM64_16K_PAGES? As builds are essentially broken, I assume this is not that relevant? Or how long has it been broken? It might be easier to just drop the "TRANSPARENT_HUGEPAGE" part from the FORCE_MAX_ZONEORDER config. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote: -Original Message- From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of David Hildenbrand Sent: Monday, February 8, 2021 9:22 PM To: Song Bao Hua (Barry Song) ; Matthew Wilcox Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; linux...@kvack.org; linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew Morton ; Alexander Viro ; gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) ; zhangfei@linaro.org; chensihang (A) Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: -Original Message- From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of Matthew Wilcox Sent: Monday, February 8, 2021 2:31 PM To: Song Bao Hua (Barry Song) Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; linux...@kvack.org; linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew Morton ; Alexander Viro ; gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) ; zhangfei@linaro.org; chensihang (A) Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote: In high-performance I/O cases, accelerators might want to perform I/O on a memory without IO page faults which can result in dramatically increased latency. Current memory related APIs could not achieve this requirement, e.g. mlock can only avoid memory to swap to backup device, page migration can still trigger IO page fault. Well ... we have two requirements. The application wants to not take page faults. The system wants to move the application to a different NUMA node in order to optimise overall performance. Why should the application's desires take precedence over the kernel's desires? And why should it be done this way rather than by the sysadmin using numactl to lock the application to a particular node? NUMA balancer is just one of many reasons for page migration. Even one simple alloc_pages() can cause memory migration in just single NUMA node or UMA system. The other reasons for page migration include but are not limited to: * memory move due to CMA * memory move due to huge pages creation Hardly we can ask users to disable the COMPACTION, CMA and Huge Page in the whole system. You're dodging the question. Should the CMA allocation fail because another application is using SVA? I would say no. I would say no as well. While IOMMU is enabled, CMA almost has one user only: IOMMU driver as other drivers will depend on iommu to use non-contiguous memory though they are still calling dma_alloc_coherent(). In iommu driver, dma_alloc_coherent is called during initialization and there is no new allocation afterwards. So it wouldn't cause runtime impact on SVA performance. Even there is new allocations, CMA will fall back to general alloc_pages() and iommu drivers are almost allocating small memory for command queues. So I would say general compound pages, huge pages, especially transparent huge pages, would be bigger concerns than CMA for internal page migration within one NUMA. Not like CMA, general alloc_pages() can get memory by moving pages other than those pinned. And there is no guarantee we can always bind the memory of SVA applications to single one NUMA, so NUMA balancing is still a concern. But I agree we need a way to make CMA success while the userspace pages are pinned. Since pin has been viral in many drivers, I assume there is a way to handle this. Otherwise, APIs like V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there is no guarantee that usersspace will allocate unmovable memory and there is no guarantee the fallback path- alloc_pages() can succeed while allocating big memory. Long term pinnings cannot go onto CMA-reserved memory, and there is similar work to also fix ZONE_MOVABLE in that regard. https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c om One of the reasons I detest using long term pinning of pages where it could be avoided. Take VFIO and RDMA as an example: these things currently can't work without them. What I read here: "DMA performance will be affected severely". That does not sound like a compelling argument to me for long term pinnings. Please find another way to achieve the same goal without long term pinnings controlled by user space - e.g., controlling when migration actually happens. For example, CMA/alloc_contig_range()/memory unplug are corner cases that happen rarely, you shouldn't have to worry about them messing with your DMA performance. I agree CMA/alloc_contig_range()/memory unplug would be corner cases,
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: -Original Message- From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of Matthew Wilcox Sent: Monday, February 8, 2021 2:31 PM To: Song Bao Hua (Barry Song) Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; linux...@kvack.org; linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew Morton ; Alexander Viro ; gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth) ; zhangfei@linaro.org; chensihang (A) Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote: In high-performance I/O cases, accelerators might want to perform I/O on a memory without IO page faults which can result in dramatically increased latency. Current memory related APIs could not achieve this requirement, e.g. mlock can only avoid memory to swap to backup device, page migration can still trigger IO page fault. Well ... we have two requirements. The application wants to not take page faults. The system wants to move the application to a different NUMA node in order to optimise overall performance. Why should the application's desires take precedence over the kernel's desires? And why should it be done this way rather than by the sysadmin using numactl to lock the application to a particular node? NUMA balancer is just one of many reasons for page migration. Even one simple alloc_pages() can cause memory migration in just single NUMA node or UMA system. The other reasons for page migration include but are not limited to: * memory move due to CMA * memory move due to huge pages creation Hardly we can ask users to disable the COMPACTION, CMA and Huge Page in the whole system. You're dodging the question. Should the CMA allocation fail because another application is using SVA? I would say no. I would say no as well. While IOMMU is enabled, CMA almost has one user only: IOMMU driver as other drivers will depend on iommu to use non-contiguous memory though they are still calling dma_alloc_coherent(). In iommu driver, dma_alloc_coherent is called during initialization and there is no new allocation afterwards. So it wouldn't cause runtime impact on SVA performance. Even there is new allocations, CMA will fall back to general alloc_pages() and iommu drivers are almost allocating small memory for command queues. So I would say general compound pages, huge pages, especially transparent huge pages, would be bigger concerns than CMA for internal page migration within one NUMA. Not like CMA, general alloc_pages() can get memory by moving pages other than those pinned. And there is no guarantee we can always bind the memory of SVA applications to single one NUMA, so NUMA balancing is still a concern. But I agree we need a way to make CMA success while the userspace pages are pinned. Since pin has been viral in many drivers, I assume there is a way to handle this. Otherwise, APIs like V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there is no guarantee that usersspace will allocate unmovable memory and there is no guarantee the fallback path- alloc_pages() can succeed while allocating big memory. Long term pinnings cannot go onto CMA-reserved memory, and there is similar work to also fix ZONE_MOVABLE in that regard. https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatas...@soleen.com One of the reasons I detest using long term pinning of pages where it could be avoided. Take VFIO and RDMA as an example: these things currently can't work without them. What I read here: "DMA performance will be affected severely". That does not sound like a compelling argument to me for long term pinnings. Please find another way to achieve the same goal without long term pinnings controlled by user space - e.g., controlling when migration actually happens. For example, CMA/alloc_contig_range()/memory unplug are corner cases that happen rarely, you shouldn't have to worry about them messing with your DMA performance. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
On 07.02.21 09:18, Zhou Wang wrote: SVA(share virtual address) offers a way for device to share process virtual address space safely, which makes more convenient for user space device driver coding. However, IO page faults may happen when doing DMA operations. As the latency of IO page fault is relatively big, DMA performance will be affected severely when there are IO page faults. From a long term view, DMA performance will be not stable. In high-performance I/O cases, accelerators might want to perform I/O on a memory without IO page faults which can result in dramatically increased latency. Current memory related APIs could not achieve this requirement, e.g. mlock can only avoid memory to swap to backup device, page migration can still trigger IO page fault. Various drivers working under traditional non-SVA mode are using their own specific ioctl to do pin. Such ioctl can be seen in v4l2, gpu, infiniband, media, vfio, etc. Drivers are usually doing dma mapping while doing pin. But, in SVA mode, pin could be a common need which isn't necessarily bound with any drivers, and neither is dma mapping needed by drivers since devices are using the virtual address of CPU. Thus, It is better to introduce a new common syscall for it. This patch leverages the design of userfaultfd and adds mempinfd for pin to avoid messing up mm_struct. A fd will be got by mempinfd, then user space can do pin/unpin pages by ioctls of this fd, all pinned pages under one file will be unpinned in file release process. Like pin page cases in other places, can_do_mlock is used to check permission and input parameters. Signed-off-by: Zhou Wang Signed-off-by: Sihang Chen Suggested-by: Barry Song --- arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + fs/Makefile | 1 + fs/mempinfd.c | 199 ++ include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/mempinfd.h | 23 + init/Kconfig | 6 ++ 8 files changed, 236 insertions(+), 2 deletions(-) create mode 100644 fs/mempinfd.c create mode 100644 include/uapi/linux/mempinfd.h diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 86a9d7b3..949788f 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 442 +#define __NR_compat_syscalls 443 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index cccfbbe..3f49529 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -891,6 +891,8 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2) __SYSCALL(__NR_process_madvise, sys_process_madvise) #define __NR_epoll_pwait2 441 __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2) +#define __NR_mempinfd 442 +__SYSCALL(__NR_mempinfd, sys_mempinfd) /* * Please add new compat syscalls above this comment and update diff --git a/fs/Makefile b/fs/Makefile index 999d1a2..e1cbf12 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -54,6 +54,7 @@ obj-$(CONFIG_COREDUMP)+= coredump.o obj-$(CONFIG_SYSCTL) += drop_caches.o obj-$(CONFIG_FHANDLE) += fhandle.o +obj-$(CONFIG_MEMPINFD) += mempinfd.o obj-y += iomap/ obj-y+= quota/ diff --git a/fs/mempinfd.c b/fs/mempinfd.c new file mode 100644 index 000..23d3911 --- /dev/null +++ b/fs/mempinfd.c @@ -0,0 +1,199 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2021 HiSilicon Limited. */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct mem_pin_container { + struct xarray array; + struct mutex lock; +}; + +struct pin_pages { + unsigned long first; + unsigned long nr_pages; + struct page **pages; +}; + +static int mempinfd_release(struct inode *inode, struct file *file) +{ + struct mem_pin_container *priv = file->private_data; + struct pin_pages *p; + unsigned long idx; + + xa_for_each(>array, idx, p) { + unpin_user_pages(p->pages, p->nr_pages); + xa_erase(>array, p->first); + vfree(p->pages); + kfree(p); + } + + mutex_destroy(>lock); + xa_destroy(>array); + kfree(priv); + + return 0; +} + +static int mempinfd_input_check(u64 addr, u64 size) +{ + if (!size || addr + size < addr) + return -EINVAL; + + if (!can_do_mlock()) + return -EPERM; + + return 0; +} + +static int mem_pin_page(struct mem_pin_container *priv, + struct
Re: [RFC 2/3] arm64/hugetlb: Enable HUGETLB_PAGE_SIZE_VARIABLE
On 04.02.21 08:01, Anshuman Khandual wrote: MAX_ORDER which invariably depends on FORCE_MAX_ZONEORDER can be a variable for a given page size, depending on whether TRANSPARENT_HUGEPAGE is enabled or not. In certain page size and THP combinations HUGETLB_PAGE_ORDER can be greater than MAX_ORDER, making it unusable as pageblock_order. Just so I understand correctly, this does not imply that we have THP that exceed the pageblock size / MAX_ORDER size, correct? This enables HUGETLB_PAGE_SIZE_VARIABLE making pageblock_order a variable rather than the compile time constant HUGETLB_PAGE_ORDER which could break MAX_ORDER rule for certain configurations. Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-ker...@lists.infradead.org Cc: linux-ker...@vger.kernel.org Signed-off-by: Anshuman Khandual --- arch/arm64/Kconfig | 4 1 file changed, 4 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 175914f2f340..c4acf8230f20 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1918,6 +1918,10 @@ config ARCH_ENABLE_THP_MIGRATION def_bool y depends on TRANSPARENT_HUGEPAGE +config HUGETLB_PAGE_SIZE_VARIABLE + def_bool y + depends on HUGETLB_PAGE + menu "Power management options" source "kernel/power/Kconfig" -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/mediatek: check 4GB mode by reading infracfg
On 21.07.20 04:16, Miles Chen wrote: > In previous discussion [1] and [2], we found that it is risky to > use max_pfn or totalram_pages to tell if 4GB mode is enabled. > > Check 4GB mode by reading infracfg register, remove the usage > of the un-exported symbol max_pfn. > > This is a step towards building mtk_iommu as a kernel module. > > Change since v1: > 1. remove the phandle usage, search for infracfg instead [3] > 2. use infracfg instead of infracfg_regmap > 3. move infracfg definitaions to linux/soc/mediatek/infracfg.h > 4. update enable_4GB only when has_4gb_mode Nit: We tend to place such information below the "---" (adding a second one) such that this information is discarded when the patch is picked up. > > [1] https://lkml.org/lkml/2020/6/3/733 > [2] https://lkml.org/lkml/2020/6/4/136 > [3] https://lkml.org/lkml/2020/7/15/1147 > > Cc: Mike Rapoport > Cc: David Hildenbrand > Cc: Yong Wu > Cc: Yingjoe Chen > Cc: Christoph Hellwig > Cc: Yong Wu > Cc: Chao Hao > Cc: Rob Herring > Cc: Matthias Brugger > Signed-off-by: Miles Chen > --- > drivers/iommu/mtk_iommu.c | 26 +- > include/linux/soc/mediatek/infracfg.h | 3 +++ > 2 files changed, 24 insertions(+), 5 deletions(-) > > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c > index 2be96f1cdbd2..16765f532853 100644 > --- a/drivers/iommu/mtk_iommu.c > +++ b/drivers/iommu/mtk_iommu.c > @@ -3,7 +3,6 @@ > * Copyright (c) 2015-2016 MediaTek Inc. > * Author: Yong Wu > */ > -#include > #include > #include > #include > @@ -15,13 +14,16 @@ > #include > #include > #include > +#include > #include > #include > #include > #include > #include > +#include > #include > #include > +#include > #include > #include > > @@ -599,8 +601,10 @@ static int mtk_iommu_probe(struct platform_device *pdev) > struct resource *res; > resource_size_t ioaddr; > struct component_match *match = NULL; > + struct regmap *infracfg; > void*protect; > int i, larb_nr, ret; > + u32 val; > > data = devm_kzalloc(dev, sizeof(*data), GFP_KERNEL); > if (!data) > @@ -614,10 +618,22 @@ static int mtk_iommu_probe(struct platform_device *pdev) > return -ENOMEM; > data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN); > > - /* Whether the current dram is over 4GB */ > - data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT)); > - if (!data->plat_data->has_4gb_mode) > - data->enable_4GB = false; > + data->enable_4GB = false; > + if (data->plat_data->has_4gb_mode) { > + infracfg = syscon_regmap_lookup_by_compatible( > + "mediatek,mt8173-infracfg"); > + if (IS_ERR(infracfg)) { > + infracfg = syscon_regmap_lookup_by_compatible( > + "mediatek,mt2712-infracfg"); > + if (IS_ERR(infracfg)) > + return PTR_ERR(infracfg); > + > + } > + ret = regmap_read(infracfg, REG_INFRA_MISC, ); > + if (ret) > + return ret; > + data->enable_4GB = !!(val & F_DDR_4GB_SUPPORT_EN); (I am absolutely not familiar with syscon_regmap_lookup_by ..., I am missing some context, so sorry for the stupid questions) Who sets the regmap value and based on what? Who decides that 4GB mode is supported or not? And who decides if 4GB mode is *required* or not? -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB
On 04.06.20 17:27, David Hildenbrand wrote: > On 04.06.20 17:06, Christoph Hellwig wrote: >> On Thu, Jun 04, 2020 at 01:32:40PM +0200, David Hildenbrand wrote: >>> Just a thought: If memory hotplug is applicable as well, you might >>> either want to always assume data->enable_4GB, or handle memory hotplug >>> events from the memory notifier, when new memory gets onlined (not sure >>> how tricky that is). >> >> We probably want a highest_pfn_possible() or similar API instead of >> having drivers poking into random VM internals. > > Well, memory notifiers are a reasonable api used accross the kernel to > get notified when new memory is onlined to the buddy that could be used > for allocations. > > highest_pfn_possible() would have to default to something linked to > MAX_PHYSMEM_BITS whenever memory hotplug is configured, I am not sure > how helpful that is (IOW, you can just default to enable_4GB=true in > that case instead in most cases). Correction: At least on x86-64 we have max_possible_pfn, which will consult the ACPI SRAT table to figure out the maximum possible PFN. (Without SRAT, max_possible_pfn will point at the end of initial boot memory and not consider hotplug memory - something that e.g., newer QEMU versions work around by creating SRAT tables if memory hotplug might be possible, even if there is no actual NUMA configuration). pci-swiotlb.c similarly relies on that to figure out if there are any !DMA addresses to handle. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB
On 04.06.20 17:06, Christoph Hellwig wrote: > On Thu, Jun 04, 2020 at 01:32:40PM +0200, David Hildenbrand wrote: >> Just a thought: If memory hotplug is applicable as well, you might >> either want to always assume data->enable_4GB, or handle memory hotplug >> events from the memory notifier, when new memory gets onlined (not sure >> how tricky that is). > > We probably want a highest_pfn_possible() or similar API instead of > having drivers poking into random VM internals. Well, memory notifiers are a reasonable api used accross the kernel to get notified when new memory is onlined to the buddy that could be used for allocations. highest_pfn_possible() would have to default to something linked to MAX_PHYSMEM_BITS whenever memory hotplug is configured, I am not sure how helpful that is (IOW, you can just default to enable_4GB=true in that case instead in most cases). -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB
On 04.06.20 11:49, Miles Chen wrote: > On Thu, 2020-06-04 at 10:25 +0200, David Hildenbrand wrote: >> On 04.06.20 10:01, Miles Chen wrote: >>> To build this driver as a kernel module, we cannot use >>> the unexported symbol "max_pfn" to setup enable_4GB. >>> >>> Use totalram_pages() instead to setup enable_4GB. >>> >>> Suggested-by: Mike Rapoport >>> Signed-off-by: Miles Chen >>> Cc: David Hildenbrand >>> Cc: Yong Wu >>> Cc: Chao Hao >>> --- >>> drivers/iommu/mtk_iommu.c | 5 ++--- >>> 1 file changed, 2 insertions(+), 3 deletions(-) >>> >>> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c >>> index 5f4d6df59cf6..c2798a6e0e38 100644 >>> --- a/drivers/iommu/mtk_iommu.c >>> +++ b/drivers/iommu/mtk_iommu.c >>> @@ -3,7 +3,6 @@ >>> * Copyright (c) 2015-2016 MediaTek Inc. >>> * Author: Yong Wu >>> */ >>> -#include >>> #include >>> #include >>> #include >>> @@ -626,8 +625,8 @@ static int mtk_iommu_probe(struct platform_device *pdev) >>> return -ENOMEM; >>> data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN); >>> >>> - /* Whether the current dram is over 4GB */ >>> - data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT)); >>> + /* Whether the current dram is over 4GB, note: DRAM start at 1GB */ >>> + data->enable_4GB = !!(totalram_pages() > ((SZ_2G + SZ_1G) >> >>> PAGE_SHIFT)); >> >> A similar thing seems to be done by >> drivers/media/platform/mtk-vpu/mtk_vpu.c: >> vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT)); >> >> I do wonder if some weird memory hotplug setups might give you false >> negatives. >> >> E.g., start a VM with 1GB and hotplug 1GB - it will be hotplugged on >> x86-64 above 4GB, turning max_pfn into 5GB. totalram_pages() should >> return something < 2GB. >> >> Same can happen when you have a VM and use ballooning to fake-unplug >> memory, making totalram_pages() return something < 4GB, but leaving >> usable pfns >= 4GB > > Yes. Yingjoe also told me that this patch is not correct. > > Thanks for pointing this out. totalram_pages() does not work > for some cases: > Just a thought: If memory hotplug is applicable as well, you might either want to always assume data->enable_4GB, or handle memory hotplug events from the memory notifier, when new memory gets onlined (not sure how tricky that is). -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB
On 04.06.20 10:01, Miles Chen wrote: > To build this driver as a kernel module, we cannot use > the unexported symbol "max_pfn" to setup enable_4GB. > > Use totalram_pages() instead to setup enable_4GB. > > Suggested-by: Mike Rapoport > Signed-off-by: Miles Chen > Cc: David Hildenbrand > Cc: Yong Wu > Cc: Chao Hao > --- > drivers/iommu/mtk_iommu.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c > index 5f4d6df59cf6..c2798a6e0e38 100644 > --- a/drivers/iommu/mtk_iommu.c > +++ b/drivers/iommu/mtk_iommu.c > @@ -3,7 +3,6 @@ > * Copyright (c) 2015-2016 MediaTek Inc. > * Author: Yong Wu > */ > -#include > #include > #include > #include > @@ -626,8 +625,8 @@ static int mtk_iommu_probe(struct platform_device *pdev) > return -ENOMEM; > data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN); > > - /* Whether the current dram is over 4GB */ > - data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT)); > + /* Whether the current dram is over 4GB, note: DRAM start at 1GB */ > + data->enable_4GB = !!(totalram_pages() > ((SZ_2G + SZ_1G) >> > PAGE_SHIFT)); A similar thing seems to be done by drivers/media/platform/mtk-vpu/mtk_vpu.c: vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT)); I do wonder if some weird memory hotplug setups might give you false negatives. E.g., start a VM with 1GB and hotplug 1GB - it will be hotplugged on x86-64 above 4GB, turning max_pfn into 5GB. totalram_pages() should return something < 2GB. Same can happen when you have a VM and use ballooning to fake-unplug memory, making totalram_pages() return something < 4GB, but leaving usable pfns >= 4GB. but ... I don't know if I understood what "enable_4GB" needs/implies ... I don't know if this is applicable to VMs at all (on real HW such memory hotplug setups should not exist) ... I don't know how this code would react to memory hotplug, so if the condition changes after the driver loaded and enable_4GB would suddenly apply. Again, most probably not relevant on real HW, only for VMs. -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V2] mm: Replace all open encodings for NUMA_NO_NODE
On 26.11.18 13:26, Anshuman Khandual wrote: > At present there are multiple places where invalid node number is encoded > as -1. Even though implicitly understood it is always better to have macros > in there. Replace these open encodings for an invalid node number with the > global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like > 'invalid node' from various places redirecting them to a common definition. > > Signed-off-by: Anshuman Khandual > --- > Changes in V2: > > - Added inclusion of 'numa.h' header at various places per Andrew > - Updated 'dev_to_node' to use NUMA_NO_NODE instead per Vinod Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] mm: Replace all open encodings for NUMA_NO_NODE
On 23.11.18 10:54, Anshuman Khandual wrote: > At present there are multiple places where invalid node number is encoded > as -1. Even though implicitly understood it is always better to have macros > in there. Replace these open encodings for an invalid node number with the > global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like > 'invalid node' from various places redirecting them to a common definition. > > Signed-off-by: Anshuman Khandual > --- > > Changes in V1: > > - Dropped OCFS2 changes per Joseph > - Dropped media/video drivers changes per Hans > > RFC - https://patchwork.kernel.org/patch/10678035/ > > Build tested this with multiple cross compiler options like alpha, sparc, > arm64, x86, powerpc, powerpc64le etc with their default config which might > not have compiled tested all driver related changes. I will appreciate > folks giving this a test in their respective build environment. > > All these places for replacement were found by running the following grep > patterns on the entire kernel code. Please let me know if this might have > missed some instances. This might also have replaced some false positives. > I will appreciate suggestions, inputs and review. > > 1. git grep "nid == -1" > 2. git grep "node == -1" > 3. git grep "nid = -1" > 4. git grep "node = -1" Hopefully you found most users :) Did you check if some are encoded into function calls? f(-1, ...) Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu