[PATCH v1 2/2] mm: enforce pageblock_order < MAX_ORDER

2022-02-14 Thread David Hildenbrand
Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:

1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
   pages via __free_pages_core(), which cannot possibly work.

2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
   start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
   pageblock_order, we could have a single pageblock partially managed by
   two zones.

3) compaction code runs into __fragmentation_index() with order
   >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

4) mm/page_reporting.c won't be reporting any pages with default
   page_reporting_order == pageblock_order, as we'll be skipping the
   reporting loop inside page_reporting_process_zone().

5) __rmqueue_fallback() will never be able to steal with
   ALLOC_NOFRAGMENT.

pageblock_order >= MAX_ORDER is weird either way: it's a pure
optimization for making alloc_contig_range(), as used for allcoation of
gigantic pages, a little more reliable to succeed. However, if there is
demand for somewhat reliable allocation of gigantic pages, affected setups
should be using CMA or boottime allocations instead.

So let's make sure that pageblock_order < MAX_ORDER and simplify.

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com

Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio_mem.c |  9 +++--
 include/linux/cma.h |  3 +--
 include/linux/pageblock-flags.h |  7 +--
 mm/Kconfig  |  3 +++
 mm/page_alloc.c | 32 
 5 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 38becd8d578c..e7d6b679596d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2476,13 +2476,10 @@ static int virtio_mem_init_hotplug(struct virtio_mem 
*vm)
  VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
/*
-* We want subblocks to span at least MAX_ORDER_NR_PAGES and
-* pageblock_nr_pages pages. This:
-* - Is required for now for alloc_contig_range() to work reliably -
-*   it doesn't properly handle smaller granularity on ZONE_NORMAL.
+* TODO: once alloc_contig_range() works reliably with pageblock
+* granularity on ZONE_NORMAL, use pageblock_nr_pages instead.
 */
-   sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES,
-   pageblock_nr_pages) * PAGE_SIZE;
+   sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES;
sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
if (sb_size < memory_block_size_bytes() && !force_bbm) {
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 75fe188ec4a1..b1ba94f1cc9c 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -25,8 +25,7 @@
  * -- can deal with only some pageblocks of a higher-order page being
  *  MIGRATE_CMA, we can use pageblock_nr_pages.
  */
-#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
- pageblock_nr_pages)
+#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES
 #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
 
 struct cma;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 973fd731a520..83c7248053a1 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -37,8 +37,11 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-/* Huge pages are a constant size */
-#define pageblock_orderHUGETLB_PAGE_ORDER
+/*
+ * Huge pages are a constant size, but don't exceed the maximum allocation
+ * granularity.
+ */
+#define pageblock_ordermin_t(unsigned int, HUGETLB_PAGE_ORDER, 
MAX_ORDER - 1)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..4c91b92e7537 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE
  HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes 
available
  on a platform.
 
+ Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
+ clamped down to MAX_ORDER - 1.
+
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..04cf964b57b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1072,14 +1072,12 @@ static inline void __free_one_page(struct page *page,
int migratetype, fpi_t fpi_flags)
 {
struct capture_control *capc = task_capc(zone);
+   unsigned int max_order = pageblock_order;
unsigned long buddy_pfn;
unsigned long combined_pfn;
-   unsigned int max_order;
struct page *buddy;
bool

[PATCH v1 1/2] cma: factor out minimum alignment requirement

2022-02-14 Thread David Hildenbrand
Let's factor out determining the minimum alignment requirement for CMA
and add a helpful comment.

No functional change intended.

Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/fadump-internal.h |  5 -
 arch/powerpc/kernel/fadump.c   |  2 +-
 drivers/of/of_reserved_mem.c   |  9 +++--
 include/linux/cma.h|  9 +
 kernel/dma/contiguous.c|  4 +---
 mm/cma.c   | 20 +---
 6 files changed, 19 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 52189928ec08..81bcb9abb371 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -19,11 +19,6 @@
 
 #define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt)
 
-/* Alignment per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT   (PAGE_SIZE <<   \
-max_t(unsigned long, MAX_ORDER - 1,\
-pageblock_order))
-
 /* FAD commands */
 #define FADUMP_REGISTER1
 #define FADUMP_UNREGISTER  2
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d03e488cfe9c..7eb67201ea41 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -544,7 +544,7 @@ int __init fadump_reserve_mem(void)
if (!fw_dump.nocma) {
fw_dump.boot_memory_size =
ALIGN(fw_dump.boot_memory_size,
- FADUMP_CMA_ALIGNMENT);
+ CMA_MIN_ALIGNMENT_BYTES);
}
 #endif
 
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 9c0fb962c22b..75caa6f5d36f 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "of_private.h"
 
@@ -116,12 +117,8 @@ static int __init __reserved_mem_alloc_size(unsigned long 
node,
if (IS_ENABLED(CONFIG_CMA)
&& of_flat_dt_is_compatible(node, "shared-dma-pool")
&& of_get_flat_dt_prop(node, "reusable", NULL)
-   && !nomap) {
-   unsigned long order =
-   max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
-
-   align = max(align, (phys_addr_t)PAGE_SIZE << order);
-   }
+   && !nomap)
+   align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
 
prop = of_get_flat_dt_prop(node, "alloc-ranges", );
if (prop) {
diff --git a/include/linux/cma.h b/include/linux/cma.h
index bd801023504b..75fe188ec4a1 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -20,6 +20,15 @@
 
 #define CMA_MAX_NAME 64
 
+/*
+ * TODO: once the buddy -- especially pageblock merging and 
alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ *  MIGRATE_CMA, we can use pageblock_nr_pages.
+ */
+#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
+ pageblock_nr_pages)
+#define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
+
 struct cma;
 
 extern unsigned long totalcma_pages;
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 3d63d91cba5c..6ea80ae42622 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -399,8 +399,6 @@ static const struct reserved_mem_ops rmem_cma_ops = {
 
 static int __init rmem_cma_setup(struct reserved_mem *rmem)
 {
-   phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
-   phys_addr_t mask = align - 1;
unsigned long node = rmem->fdt_node;
bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL);
struct cma *cma;
@@ -416,7 +414,7 @@ static int __init rmem_cma_setup(struct reserved_mem *rmem)
of_get_flat_dt_prop(node, "no-map", NULL))
return -EINVAL;
 
-   if ((rmem->base & mask) || (rmem->size & mask)) {
+   if (!IS_ALIGNED(rmem->base | rmem->size, CMA_MIN_ALIGNMENT_BYTES)) {
pr_err("Reserved memory: incorrect alignment of CMA region\n");
return -EINVAL;
}
diff --git a/mm/cma.c b/mm/cma.c
index bc9ca8f3c487..5a2cd5851658 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -168,7 +168,6 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
phys_addr_t size,
 struct cma **res_cma)
 {
struct cma *cma;
-   phys_addr_t alignment;
 
/* Sanity checks */
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
@@ -179,15 +178,12 @@ int __init cma_init_reserved_mem(phys_addr_t base

[PATCH v1 0/2] mm: enforce pageblock_order < MAX_ORDER

2022-02-14 Thread David Hildenbrand
Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
cases and some parts of the kernel are not prepared for it.

For example, Aneesh has shown [1] that such kernels can be compiled on
ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run
into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during
boot.

We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
bigger than the maximum allocation granularity of the buddy, in which case
we are no longer talking about huge pages but instead gigantic pages.

Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of
such gigantic pages more likely to succeed.

Reliable use of gigantic pages either requires boot time allcoation or CMA,
no need to overcomplicate some places in the kernel to optimize for corner
cases that are broken in other areas of the kernel.

Let's enforce pageblock_order < MAX_ORDER and simplify.

Especially patch #1 can be regarded a cleanup before:
[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
alignment. [2]

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com
[2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi@sent.com

Cc: Andrew Morton 
Cc: Aneesh Kumar K V 
Cc: Zi Yan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Rob Herring 
Cc: Frank Rowand 
Cc: "Michael S. Tsirkin" 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Minchan Kim 
Cc: Vlastimil Babka 
Cc: linuxppc-...@lists.ozlabs.org
Cc: devicet...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: iommu@lists.linux-foundation.org
Cc: linux...@kvack.org

David Hildenbrand (2):
  cma: factor out minimum alignment requirement
  mm: enforce pageblock_order < MAX_ORDER

 arch/powerpc/include/asm/fadump-internal.h |  5 
 arch/powerpc/kernel/fadump.c   |  2 +-
 drivers/of/of_reserved_mem.c   |  9 ++
 drivers/virtio/virtio_mem.c|  9 ++
 include/linux/cma.h|  8 ++
 include/linux/pageblock-flags.h|  7 +++--
 kernel/dma/contiguous.c|  4 +--
 mm/Kconfig |  3 ++
 mm/cma.c   | 20 --
 mm/page_alloc.c| 32 ++
 10 files changed, 37 insertions(+), 62 deletions(-)


base-commit: 754e0b0e35608ed5206d6a67a791563c631cec07
-- 
2.34.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages

2022-02-02 Thread David Hildenbrand
On 02.02.22 13:18, Oscar Salvador wrote:
> On Wed, Jan 19, 2022 at 02:06:19PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Enable set_migratetype_isolate() to check specified sub-range for
>> unmovable pages during isolation. Page isolation is done
>> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
>> pages within that granularity are intended to be isolated. For example,
>> alloc_contig_range(), which uses page isolation, allows ranges without
>> alignment. This commit makes unmovable page check only look for
>> interesting pages, so that page isolation can succeed for any
>> non-overlapping ranges.
> 
> Another thing that came to my mind.
> Prior to this patch, has_unmovable_pages() was checking on pageblock
> granularity, starting at pfn#0 of the pageblock.
> With this patch, you no longer check on pageblock granularity, which
> means you might isolate a pageblock, but some pages that sneaked in
> might actually be unmovable.
> 
> E.g:
> 
> Let's say you have a pageblock that spans (pfn#512,pfn#1024),
> and you pass alloc_contig_range() (pfn#514,pfn#1024).
> has_unmovable_pages() will start checking the pageblock at pfn#514,
> and so it will mis pfn#512 and pfn#513. Isn't that a problem, if those
> pfn turn out to be actually unmovable?

That's the whole idea for being able to allocate parts of an unmovable
pageblock that are movable.

If the first part is unmovable but the second part is movable, nothing
should stop us from trying to allocate the second part.

Of course, we want to remember the original migratetype of the
pageblock, to restore that after isolation -- otherwise we'll end up
converting all such pageblocks to MIGRATE_MOVABLE. The next patch does
that IIRC.

However, devil is in the detail, and I still have to review those parts
of this series.


Note that there are no current users of alloc_contig_range() that
allocate < MAX_ORDER - 1 -- except CMA, but for CMA we immediately exit
has_unmovable_pages() either way.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 7/8] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.

2022-01-14 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> alloc_contig_range() now only needs to be aligned to pageblock_order,
> drop virtio_mem size requirement that it needs to be the max of
> pageblock_order and MAX_ORDER.
> 
> Signed-off-by: Zi Yan 
> ---
>  drivers/virtio/virtio_mem.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index a6a78685cfbe..2664dc16d0f9 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2481,8 +2481,7 @@ static int virtio_mem_init_hotplug(struct virtio_mem 
> *vm)
>* - Is required for now for alloc_contig_range() to work reliably -
>*   it doesn't properly handle smaller granularity on ZONE_NORMAL.
>*/

Please also update this comment.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 5/8] mm: page_isolation: check specified range for unmovable pages during isolation.

2022-01-14 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> Enable set_migratetype_isolate() to check specified sub-range for
> unmovable pages during isolation. Page isolation is done
> at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
> pages within that granularity are intended to be isolated. For example,
> alloc_contig_range(), which uses page isolation, allows ranges without
> alignment. This commit makes unmovable page check only look for
> interesting pages, so that page isolation can succeed for any
> non-overlapping ranges.

Are you handling if we start checking in the middle of a compound page
and actually have to lookup the head to figure out if movable or not?

> 
> has_unmovable_pages() is moved to mm/page_isolation.c since it is only
> used by page isolation.

Please move that into a separate patch upfront, makes this patch much
easier to review.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.

2022-01-13 Thread David Hildenbrand
On 13.01.22 16:46, Zi Yan wrote:
> On 12 Jan 2022, at 6:04, David Hildenbrand wrote:
> 
>> On 05.01.22 22:47, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> alloc_migration_target() is used by alloc_contig_range() and non-LRU
>>> movable compound pages can be migrated. Current code does not allocate the
>>> right page size for such pages. Check THP precisely using
>>> is_transparent_huge() and add allocation support for non-LRU compound
>>> pages.
>>
>> IIRC, we don't have any non-lru migratable pages that are coumpound
>> pages. Read: not used and not supported :)
> 
> OK, but nothing prevents one writing a driver that allocates compound
> pages and provides address_space->migratepage() and 
> address_space->isolate_page().
> 
> Actually, to test this series, I write a kernel module that allocates
> an order-10 page, gives it a fake address_space with migratepage() and
> isolate_page(), __SetPageMovable() on it, then call alloc_contig_range()
> on the page range. Apparently, my kernel module is not supported by
> the kernel, thus, I added this patch.
> 
> Do you have an alternative test to my kernel module, so that I do not
> even need this patch myself?
> 
>> Why is this required in the context of this series?
> 
> It might not be required. I will drop it.

That's why I think it would be best dropping it. If you need it in
different context, better submit it in different context.

Makes this series easier to digest :)


-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.

2022-01-13 Thread David Hildenbrand
On 13.01.22 12:36, Mike Rapoport wrote:
> On Wed, Jan 12, 2022 at 11:54:49AM +0100, David Hildenbrand wrote:
>> On 05.01.22 22:47, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
>>> It prepares for the upcoming removal of the MAX_ORDER-1 alignment
>>> requirement for CMA and alloc_contig_range().
>>>
>>> MIGRARTE_HIGHATOMIC should not merge with other migratetypes like
>>> MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
>>> Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness.
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/
>>>
>>> Signed-off-by: Zi Yan 
>>> ---
>>>  include/linux/mmzone.h |  6 ++
>>>  mm/page_alloc.c| 28 ++--
>>>  2 files changed, 24 insertions(+), 10 deletions(-)
>>>
> 
> ...
> 
>>> @@ -3545,8 +3553,8 @@ int __isolate_free_page(struct page *page, unsigned 
>>> int order)
>>> struct page *endpage = page + (1 << order) - 1;
>>> for (; page < endpage; page += pageblock_nr_pages) {
>>> int mt = get_pageblock_migratetype(page);
>>> -   if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
>>> -   && !is_migrate_highatomic(mt))
>>> +   /* Only change normal pageblock */
>>> +   if (migratetype_has_fallback(mt))
>>> set_pageblock_migratetype(page,
>>>   MIGRATE_MOVABLE);
>>> }
>>
>> That part is a nice cleanup IMHO. Although the "has fallback" part is a
>> bit imprecise. "migratetype_is_mergable()" might be a bit clearer.
>> ideally "migratetype_is_mergable_with_other_types()". Can we come up
>> with a nice name for that?
> 
> migratetype_is_mergable() kinda implies "_with_other_types", no?
> 
> I like migratetype_is_mergable() more than _has_fallback().
> 
> My $0.02 to bikeshedding :)

:)

Yeah, for me migratetype_is_mergable() would also be good enough. I
think I was at first thinking one could mistake it with a dedicated
migratetype. But such functions are historically called

is_migrate_cma/is_migrate_cma/

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 3/8] mm: migrate: allocate the right size of non hugetlb or THP compound pages.

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> alloc_migration_target() is used by alloc_contig_range() and non-LRU
> movable compound pages can be migrated. Current code does not allocate the
> right page size for such pages. Check THP precisely using
> is_transparent_huge() and add allocation support for non-LRU compound
> pages.

IIRC, we don't have any non-lru migratable pages that are coumpound
pages. Read: not used and not supported :)

Why is this required in the context of this series?


-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/8] mm: compaction: handle non-lru compound pages properly in isolate_migratepages_block().

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> In isolate_migratepages_block(), a !PageLRU tail page can be encountered
> when the page is larger than a pageblock. Use compound head page for the
> checks inside and skip the entire compound page when isolation succeeds.
> 

This will currently never happen, due to the way we always isolate
MAX_ORDER -1 ranges, correct?

Better note that in the patch description, because currently it reads
like it's an actual fix "can be encountered".

> Signed-off-by: Zi Yan 
> ---
>  mm/compaction.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b4e94cda3019..ad9053fbbe06 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -979,19 +979,23 @@ isolate_migratepages_block(struct compact_control *cc, 
> unsigned long low_pfn,
>* Skip any other type of page
>*/
>   if (!PageLRU(page)) {
> + struct page *head = compound_head(page);
>   /*
>* __PageMovable can return false positive so we need
>* to verify it under page_lock.
>*/
> - if (unlikely(__PageMovable(page)) &&
> - !PageIsolated(page)) {
> + if (unlikely(__PageMovable(head)) &&
> + !PageIsolated(head)) {
>   if (locked) {
>   unlock_page_lruvec_irqrestore(locked, 
> flags);
>   locked = NULL;
>   }
>  
> - if (!isolate_movable_page(page, isolate_mode))
> + if (!isolate_movable_page(head, isolate_mode)) {
> + low_pfn += (1 << compound_order(head)) 
> - 1 - (page - head);
> + page = head;
>   goto isolate_success;
> + }
>   }
>  
>   goto isolate_fail;


-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/8] mm: page_alloc: avoid merging non-fallbackable pageblocks with others.

2022-01-12 Thread David Hildenbrand
On 05.01.22 22:47, Zi Yan wrote:
> From: Zi Yan 
> 
> This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
> It prepares for the upcoming removal of the MAX_ORDER-1 alignment
> requirement for CMA and alloc_contig_range().
> 
> MIGRARTE_HIGHATOMIC should not merge with other migratetypes like
> MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
> Also add MIGRARTE_HIGHATOMIC to fallbacks array for completeness.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20211130100853.gp3...@techsingularity.net/
> 
> Signed-off-by: Zi Yan 
> ---
>  include/linux/mmzone.h |  6 ++
>  mm/page_alloc.c| 28 ++--
>  2 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..0aa549653e4e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -83,6 +83,12 @@ static inline bool is_migrate_movable(int mt)
>   return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
>  }
>  
> +/* See fallbacks[MIGRATE_TYPES][3] in page_alloc.c */
> +static inline bool migratetype_has_fallback(int mt)
> +{
> + return mt < MIGRATE_PCPTYPES;
> +}
> +
>  #define for_each_migratetype_order(order, type) \
>   for (order = 0; order < MAX_ORDER; order++) \
>   for (type = 0; type < MIGRATE_TYPES; type++)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8dd6399bafb5..5193c953dbf8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1042,6 +1042,12 @@ buddy_merge_likely(unsigned long pfn, unsigned long 
> buddy_pfn,
>   return page_is_buddy(higher_page, higher_buddy, order + 1);
>  }
>  
> +static inline bool has_non_fallback_pageblock(struct zone *zone)
> +{
> + return has_isolate_pageblock(zone) || zone_cma_pages(zone) != 0 ||
> + zone->nr_reserved_highatomic != 0;
> +}

Due to zone_cma_pages(), the unlikely() below will be very wrong on many
setups. Previously, isolation really was a corner case. CMA and
highatomic are less of a corner case ...

I'm not even sure if this check is worth having around anymore at all,
or if it would be easier and cheaper to just always check the both
migration types unconditionally. Would certainly simplify the code.

Side node: we actually care about has_free_non_fallback_pageblock(), we
can only merge with free pageblocks. But that might not necessarily be
cheaper to test/track/check.

> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1117,14 +1123,15 @@ static inline void __free_one_page(struct page *page,
>   }
>   if (order < MAX_ORDER - 1) {
>   /* If we are here, it means order is >= pageblock_order.
> -  * We want to prevent merge between freepages on isolate
> -  * pageblock and normal pageblock. Without this, pageblock
> -  * isolation could cause incorrect freepage or CMA accounting.
> +  * We want to prevent merge between freepages on pageblock
> +  * without fallbacks and normal pageblock. Without this,
> +  * pageblock isolation could cause incorrect freepage or CMA
> +  * accounting or HIGHATOMIC accounting.
>*
>* We don't want to hit this code for the more frequent
>* low-order merging.
>*/
> - if (unlikely(has_isolate_pageblock(zone))) {
> + if (unlikely(has_non_fallback_pageblock(zone))) {
>   int buddy_mt;
>  
>   buddy_pfn = __find_buddy_pfn(pfn, order);
> @@ -1132,8 +1139,8 @@ static inline void __free_one_page(struct page *page,
>   buddy_mt = get_pageblock_migratetype(buddy);
>  
>   if (migratetype != buddy_mt
> - && (is_migrate_isolate(migratetype) ||
> - is_migrate_isolate(buddy_mt)))
> + && 
> (!migratetype_has_fallback(migratetype) ||
> + 
> !migratetype_has_fallback(buddy_mt)))
>   goto done_merging;
>   }
>   max_order = order + 1;
> @@ -2484,6 +2491,7 @@ static int fallbacks[MIGRATE_TYPES][3] = {
>   [MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   
> MIGRATE_TYPES },
>   [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, 
> MIGRATE_TYPES },
>   [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   
> MIGRATE_TYPES },
> + [MIGRATE_HIGHATOMIC] = { MIGRATE_TYPES }, /* Never used */
>  #ifdef CONFIG_CMA
>   [MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
>  #endif
> @@ -2795,8 +2803,8 @@ static void reserve_highatomic_pageblock(struct page 
> *page, struct zone *zone,
>  
>   /* Yoink! */
>   mt = get_pageblock_migratetype(page);
> - if (!is_migrate_highatomic(mt) && 

Re: [RFC PATCH v2 0/7] Use pageblock_order for cma and alloc_contig_range alignment.

2021-12-10 Thread David Hildenbrand
On 10.12.21 00:04, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi all,

Hi,

thanks for working on that!

> 
> This patchset tries to remove the MAX_ORDER - 1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make 
> MAX_ORDER
> adjustable at boot time[1].
> 
> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
> isolates pageblocks to remove free memory from buddy allocator but isolating
> only a subset of pageblocks within a page spanning across multiple pageblocks
> causes free page accounting issues. Isolated page might not be put into the
> right free list, since the code assumes the migratetype of the first pageblock
> as the whole free page migratetype. This is based on the discussion at [2].
> 
> To remove the requirement, this patchset:
> 1. still isolates pageblocks at MAX_ORDER - 1 granularity;
> 2. but saves the pageblock migratetypes outside the specified range of
>alloc_contig_range() and restores them after all pages within the range
>become free after __alloc_contig_migrate_range();
> 3. splits free pages spanning multiple pageblocks at the beginning and the end
>of the range and puts the split pages to the right migratetype free lists
>based on the pageblock migratetypes;
> 4. returns pages not in the range as it did before this patch.
> 
> Isolation needs to happen at MAX_ORDER - 1 granularity, because otherwise
> 1) extra code is needed to detect pages (free, PageHuge, THP, or PageCompound)
> to make sure all pageblocks belonging to a single page are isolated together 
> and later pageblocks outside the range need to have their migratetypes 
> restored;
> or 2) extra logic will need to be added during page free time to split a free
> page with multi-migratetype pageblocks.
> 
> Two optimizations might come later:
> 1. only check unmovable pages within the range instead of MAX_ORDER - 1 
> aligned
>range during isolation to increase successful rate of alloc_contig_range().

The issue with virtio-mem is that we'll need that as soon as we change
the granularity to pageblocks, because otherwise, you can heavily
degrade unplug reliably in sane setups:

Previous:
* Try unplug free 4M range (2 pageblocks): succeeds

Now:
* Try unplug 2M range (first pageblock): succeeds.
* Try unplug next 2M range (second pageblock): fails because first
contains unmovable allcoations.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-23 Thread David Hildenbrand
On 17.11.21 04:04, Zi Yan wrote:
> On 16 Nov 2021, at 3:58, David Hildenbrand wrote:
> 
>> On 15.11.21 20:37, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> Hi David,
>>
>> Hi,
>>
>> thanks for looking into this.
>>

Hi,

sorry for the delay, I wasn't "actually working" last week, so now I'm
back from holiday :)

>>>
>>> You suggested to make alloc_contig_range() deal with pageblock_order 
>>> instead of
>>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
>>> patchset is my attempt to achieve that. Please take a look and let me know 
>>> if
>>> I am doing it correctly or not.
>>>
>>> From what my understanding, cma required alignment of
>>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
>>> introduced,
>>> __free_one_page() does not prevent merging two different pageblocks, when
>>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
>>> implementation
>>> does prevent that. It should be OK to just align cma to pageblock_order.
>>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
>>> pageblock_order as alignment too.
>>
>> I wonder if that's sufficient. Especially the outer_start logic in
>> alloc_contig_range() might be problematic. There are some ugly corner
>> cases with free pages/allocations spanning multiple pageblocks and we
>> only isolated a single pageblock.
> 
> Thank you a lot for writing the list of these corner cases. They are
> very helpful!
> 
>>
>>
>> Regarding CMA, we have to keep the following cases working:
>>
>> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
>>- 1 page:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
>> but not both. We have to make sure alloc_contig_range() keeps working
>> correctly. This should be the case even with your change, as we won't
>> merging pages accross differing migratetypes.
> 
> Yes.
> 
>>
>> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume both are MIGRATE_CMA. Assume we want to either allocate from
>> pageblock 0 or pageblock 1. Especially, assume we want to allocate from
>> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
>> pageblock 0.
>>
>> What happens if we either have a free page spanning the MAX_ORDER - 1
>> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
>> in a free MAX_ORDER - 1 page of which only the second pageblock is
>> isolated? We would end up essentially freeing a page that has mixed
>> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
>> might be wrong but I have the feeling that this would be problematic.
>>
> 
> This could happen when start_isolate_page_range() stumbles upon a compound
> page with order >= pageblock_order or a free page with order >=
> pageblock_order, but should not. start_isolate_page_range() should check
> the actual page size, either compound page size or free page size, and set
> the migratetype across pageblocks if the page is bigger than pageblock size.
> More precisely set_migratetype_isolate() should do that.

Right -- but then we have to extend the isolation range from within
set_migratetype_isolate() :/ E.g., how should we know what we have to
unisolate later ..

> 
> 
>> c) Concurrent allocations:
>> [   MAX_ ORDER - 1 ]
>> [ pageblock 0 | pageblock 1]
>>
>> Assume b) but we have two concurrent CMA allocations to pageblock 0 and
>> pageblock 1, which would now be possible as start_isolate_page_range()
>> isolate would succeed on both.
> 
> Two isolations will be serialized by the zone lock taken by
> set_migratetype_isolate(), so the concurrent allocation would not be a 
> problem.
> If it is a MAX_ORDER-1 free page, the first comer should split it and only
> isolate one of the pageblock then second one can isolate the other pageblock.

Right, the issue is essentially b) above.

> If it is a MAX_ORDER-1 compound page, the first comer should isolate both
> pageblocks, then the second one would fail. WDYT?

Possibly we could even have two independent CMA areas "colliding" within
a MAX_ ORDER - 1 page. I guess the surprise would be getting an
"-EAGAIN" from isolation, but the caller should properly handle that.

Maybe it's really easier to 

Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-16 Thread David Hildenbrand
On 15.11.21 20:37, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi David,

Hi,

thanks for looking into this.

> 
> You suggested to make alloc_contig_range() deal with pageblock_order instead 
> of
> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
> patchset is my attempt to achieve that. Please take a look and let me know if
> I am doing it correctly or not.
> 
> From what my understanding, cma required alignment of
> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced,
> __free_one_page() does not prevent merging two different pageblocks, when
> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation
> does prevent that. It should be OK to just align cma to pageblock_order.
> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
> pageblock_order as alignment too.

I wonder if that's sufficient. Especially the outer_start logic in
alloc_contig_range() might be problematic. There are some ugly corner
cases with free pages/allocations spanning multiple pageblocks and we
only isolated a single pageblock.


Regarding CMA, we have to keep the following cases working:

a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
   - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
but not both. We have to make sure alloc_contig_range() keeps working
correctly. This should be the case even with your change, as we won't
merging pages accross differing migratetypes.

b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume both are MIGRATE_CMA. Assume we want to either allocate from
pageblock 0 or pageblock 1. Especially, assume we want to allocate from
pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
pageblock 0.

What happens if we either have a free page spanning the MAX_ORDER - 1
range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
in a free MAX_ORDER - 1 page of which only the second pageblock is
isolated? We would end up essentially freeing a page that has mixed
pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
might be wrong but I have the feeling that this would be problematic.

c) Concurrent allocations:
[   MAX_ ORDER - 1 ]
[ pageblock 0 | pageblock 1]

Assume b) but we have two concurrent CMA allocations to pageblock 0 and
pageblock 1, which would now be possible as start_isolate_page_range()
isolate would succeed on both.


Regarding virtio-mem, we care about the following cases:

a) Allocating parts from completely movable MAX_ ORDER - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume pageblock 0 and pageblock 1 are either free or contain only
movable pages. Assume we allocated pageblock 0. We have to make sure we
can allocate pageblock 1. The other way around, assume we allocated
pageblock 1, we have to make sure we can allocate pageblock 0.

Free pages spanning both pageblocks might be problematic.

b) Allocate parts of partially movable MAX_ ORDER - 1 page:
   [   MAX_ ORDER - 1 ]
   [ pageblock 0 | pageblock 1]

Assume pageblock 0 contains unmovable data but pageblock 1 not: we have
to make sure we can allocate pageblock 1. Similarly, assume pageblock 1
contains unmovable data but pageblock 0 no: we have to make sure we can
allocate pageblock 1. has_unmovable_pages() might allow for that.

But, we want to fail early in case we want to allocate a single
pageblock but it contains unmovable data. This could be either directly
or indirectly.

If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating
pageblock 1, has_unmovable_pages() would always return "false" because
we'd simply be skiping over any tail pages, and not detect the
un-movability.

c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:

Same concern as for CMA b)


So the biggest concern I have is dealing with migrating/freeing >
pageblock_order pages while only having isolated a single pageblock.

> 
> In terms of virtio_mem, if I understand correctly, it relies on
> alloc_contig_range() to obtain contiguous free pages and offlines them to 
> reduce
> guest memory size. As the result of alloc_contig_range() alignment change,
> virtio_mem should be able to just align PFNs to pageblock_order.

For virtio-mem it will most probably be desirable to first try
allocating the MAX_ORDER -1 range if possible and then fallback to
pageblock_order. But that's an additional change on top in virtio-mem code.



My take to teach alloc_contig_range() to properly handle would be the
following:

a) Convert MIGRATE_ISOLATE into a separate pageblock flag

We would want to convert MIGRATE_ISOLATE into a separate pageblock
flags, such that when we isolate a page block we preserve the original
migratetype. While start_isolate_page_range() 

Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-13 Thread David Hildenbrand
On 13.10.21 14:17, Michael S. Tsirkin wrote:
> On Wed, Oct 13, 2021 at 01:03:46PM +0200, David Hildenbrand wrote:
>> On 13.10.21 12:55, Michael S. Tsirkin wrote:
>>> This will enable cleanups down the road.
>>> The idea is to disable cbs, then add "flush_queued_cbs" callback
>>> as a parameter, this way drivers can flush any work
>>> queued after callbacks have been disabled.
>>>
>>> Signed-off-by: Michael S. Tsirkin 
>>> ---
>>>   arch/um/drivers/virt-pci.c | 2 +-
>>>   drivers/block/virtio_blk.c | 4 ++--
>>>   drivers/bluetooth/virtio_bt.c  | 2 +-
>>>   drivers/char/hw_random/virtio-rng.c| 2 +-
>>>   drivers/char/virtio_console.c  | 4 ++--
>>>   drivers/crypto/virtio/virtio_crypto_core.c | 8 
>>>   drivers/firmware/arm_scmi/virtio.c | 2 +-
>>>   drivers/gpio/gpio-virtio.c | 2 +-
>>>   drivers/gpu/drm/virtio/virtgpu_kms.c   | 2 +-
>>>   drivers/i2c/busses/i2c-virtio.c| 2 +-
>>>   drivers/iommu/virtio-iommu.c   | 2 +-
>>>   drivers/net/caif/caif_virtio.c | 2 +-
>>>   drivers/net/virtio_net.c   | 4 ++--
>>>   drivers/net/wireless/mac80211_hwsim.c  | 2 +-
>>>   drivers/nvdimm/virtio_pmem.c   | 2 +-
>>>   drivers/rpmsg/virtio_rpmsg_bus.c   | 2 +-
>>>   drivers/scsi/virtio_scsi.c | 2 +-
>>>   drivers/virtio/virtio.c| 5 +
>>>   drivers/virtio/virtio_balloon.c| 2 +-
>>>   drivers/virtio/virtio_input.c  | 2 +-
>>>   drivers/virtio/virtio_mem.c| 2 +-
>>>   fs/fuse/virtio_fs.c| 4 ++--
>>>   include/linux/virtio.h | 1 +
>>>   net/9p/trans_virtio.c  | 2 +-
>>>   net/vmw_vsock/virtio_transport.c   | 4 ++--
>>>   sound/virtio/virtio_card.c | 4 ++--
>>>   26 files changed, 39 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/arch/um/drivers/virt-pci.c b/arch/um/drivers/virt-pci.c
>>> index c08066633023..22c4d87c9c15 100644
>>> --- a/arch/um/drivers/virt-pci.c
>>> +++ b/arch/um/drivers/virt-pci.c
>>> @@ -616,7 +616,7 @@ static void um_pci_virtio_remove(struct virtio_device 
>>> *vdev)
>>> int i;
>>>   /* Stop all virtqueues */
>>> -vdev->config->reset(vdev);
>>> +virtio_reset_device(vdev);
>>>   vdev->config->del_vqs(vdev);
>>
>> Nit: virtio_device_reset()?
>>
>> Because I see:
>>
>> int virtio_device_freeze(struct virtio_device *dev);
>> int virtio_device_restore(struct virtio_device *dev);
>> void virtio_device_ready(struct virtio_device *dev)
>>
>> But well, there is:
>> void virtio_break_device(struct virtio_device *dev);
> 
> Exactly. I don't know what's best, so I opted for plain english :)

Fair enough, LGTM

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-13 Thread David Hildenbrand

On 13.10.21 12:55, Michael S. Tsirkin wrote:

This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.

Signed-off-by: Michael S. Tsirkin 
---
  arch/um/drivers/virt-pci.c | 2 +-
  drivers/block/virtio_blk.c | 4 ++--
  drivers/bluetooth/virtio_bt.c  | 2 +-
  drivers/char/hw_random/virtio-rng.c| 2 +-
  drivers/char/virtio_console.c  | 4 ++--
  drivers/crypto/virtio/virtio_crypto_core.c | 8 
  drivers/firmware/arm_scmi/virtio.c | 2 +-
  drivers/gpio/gpio-virtio.c | 2 +-
  drivers/gpu/drm/virtio/virtgpu_kms.c   | 2 +-
  drivers/i2c/busses/i2c-virtio.c| 2 +-
  drivers/iommu/virtio-iommu.c   | 2 +-
  drivers/net/caif/caif_virtio.c | 2 +-
  drivers/net/virtio_net.c   | 4 ++--
  drivers/net/wireless/mac80211_hwsim.c  | 2 +-
  drivers/nvdimm/virtio_pmem.c   | 2 +-
  drivers/rpmsg/virtio_rpmsg_bus.c   | 2 +-
  drivers/scsi/virtio_scsi.c | 2 +-
  drivers/virtio/virtio.c| 5 +
  drivers/virtio/virtio_balloon.c| 2 +-
  drivers/virtio/virtio_input.c  | 2 +-
  drivers/virtio/virtio_mem.c| 2 +-
  fs/fuse/virtio_fs.c| 4 ++--
  include/linux/virtio.h | 1 +
  net/9p/trans_virtio.c  | 2 +-
  net/vmw_vsock/virtio_transport.c   | 4 ++--
  sound/virtio/virtio_card.c | 4 ++--
  26 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/um/drivers/virt-pci.c b/arch/um/drivers/virt-pci.c
index c08066633023..22c4d87c9c15 100644
--- a/arch/um/drivers/virt-pci.c
+++ b/arch/um/drivers/virt-pci.c
@@ -616,7 +616,7 @@ static void um_pci_virtio_remove(struct virtio_device *vdev)
int i;
  
  /* Stop all virtqueues */

-vdev->config->reset(vdev);
+virtio_reset_device(vdev);
  vdev->config->del_vqs(vdev);


Nit: virtio_device_reset()?

Because I see:

int virtio_device_freeze(struct virtio_device *dev);
int virtio_device_restore(struct virtio_device *dev);
void virtio_device_ready(struct virtio_device *dev)

But well, there is:
void virtio_break_device(struct virtio_device *dev);

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource

2021-09-30 Thread David Hildenbrand

On 30.09.21 03:30, Mike Rapoport wrote:

From: Mike Rapoport 

dma_map_resource() uses pfn_valid() to ensure the range is not RAM.
However, pfn_valid() only checks for availability of the memory map for a
PFN but it does not ensure that the PFN is actually backed by RAM.

As dma_map_resource() is the only method in DMA mapping APIs that has this
check, simply drop the pfn_valid() test from dma_map_resource().

Link: https://lore.kernel.org/all/20210824173741.gc...@arm.com/
Signed-off-by: Mike Rapoport 
---


Acked-by: David Hildenbrand 

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: An cma optimization patch is used for cma_[alloc|free].

2021-08-13 Thread David Hildenbrand

On 13.08.21 09:00, Jichao Zou wrote:

Pre-allocate CMA memory that configured in device
tree, this greatly improves the CMA memory
allocation efficiency, cma_[alloc|free] is less
than 1ms, old way is took a few ms to tens or
hundreds ms.



Please send patches as proper emails (man git-format-patch; man 
git-send-email).


What you propose is turning cma reservations into something comparable 
to permanent boottime allocations. From the POV of the buddy, the pages 
are always allocated and cannot be repurposed for e.g., movable 
allocations until *actually* allocated via CMA.


I don't think we want this behavior upstream.

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA

2021-04-12 Thread David Hildenbrand

On 12.04.21 15:12, Robin Murphy wrote:

On 2021-04-09 14:39, David Hildenbrand wrote:

On 09.04.21 15:35, Arnd Bergmann wrote:

On Fri, Apr 9, 2021 at 1:21 PM David Hildenbrand 
wrote:


Random drivers should not override a user configuration of core knobs
(e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA,
which depends on CMA, if possible; however, these drivers also have to
tolerate if DMA_CMA is not available/functioning, for example, if no CMA
area for DMA_CMA use has been setup via "cma=X". In the worst case, the
driver cannot do it's job properly in some configurations.

For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and
DMA_CMA if
available") documents
  While this is no build dependency, etnaviv will only work
correctly
  on most systems if CMA and DMA_CMA are enabled. Select both
options
  if available to avoid users ending up with a non-working GPU
due to
  a lacking kernel config.
So etnaviv really wants to have DMA_CMA, however, can deal with some
cases
where it is not available.

Let's introduce WANT_DMA_CMA and use it in most cases where drivers
select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because
of recursive dependency issues).

We'll assume that any driver that selects DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible.

With this change, distributions can disable CONFIG_CMA or
CONFIG_DMA_CMA, without it silently getting enabled again by random
drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA
and CONFIG_DMA_CMA if they are unspecified and any driver is around that
selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER.

For example, if any driver selects WANT_DMA_CMA and we do a
"make olddefconfig":

1. With "# CONFIG_CMA is not set" and no specification of
     "CONFIG_DMA_CMA"

-> CONFIG_DMA_CMA won't be part of .config

2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA

Contiguous Memory Allocator (CMA) [Y/n/?] (NEW)
DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW)

3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set"

-> CONFIG_DMA_CMA will be removed from .config

Note: drivers/remoteproc seems to be special; commit c51e882cd711
("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains
that
there is a real dependency to DMA_CMA for it to work; leave that
dependency
in place and don't convert it to a soft dependency.


I don't think this dependency is fundamentally different from the others,
though davinci machines tend to have less memory than a lot of the
other machines, so it's more likely to fail without CMA.



I was also unsure - and Lucas had similar thoughts. If you want, I can
send a v4 also taking care of this.


TBH I think it should all just be removed. DMA_CMA is effectively an
internal feature of the DMA API, and drivers which simply use the DMA
API shouldn't really be trying to assume *how* things might be allocated
at runtime - CMA is hardly the only way. Platform-level assumptions
about the presence or not of IOMMUs, memory carveouts, etc., and whether
it even matters - e.g. a device with a tiny LCD may only need display
buffers which still fit in a regular MAX_ORDER allocation - could go in
platform-specific configs, but I really don't think they belong at the
generic subsystem level.

We already have various examples like I2S drivers that won't even probe
without a dmaengine provider being present, or host controller drivers
which are useless without their corresponding phy driver (and I'm
guessing you can probably also do higher-level things like include the
block layer but omit all filesystem drivers). I don't believe it's
Kconfig's job to try to guess whether a given configuration is *useful*,
only to enforce that's it's valid to build.


That would mean: if it's not a built-time dependency, don't mention it 
in Kconfig.


If that were true, why do we have have defaults modeled in Kconfig then?

IMHO, some part of Kconfig is to give you sane defaults.

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA

2021-04-09 Thread David Hildenbrand

On 09.04.21 15:35, Arnd Bergmann wrote:

On Fri, Apr 9, 2021 at 1:21 PM David Hildenbrand  wrote:


Random drivers should not override a user configuration of core knobs
(e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA,
which depends on CMA, if possible; however, these drivers also have to
tolerate if DMA_CMA is not available/functioning, for example, if no CMA
area for DMA_CMA use has been setup via "cma=X". In the worst case, the
driver cannot do it's job properly in some configurations.

For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and DMA_CMA if
available") documents
 While this is no build dependency, etnaviv will only work correctly
 on most systems if CMA and DMA_CMA are enabled. Select both options
 if available to avoid users ending up with a non-working GPU due to
 a lacking kernel config.
So etnaviv really wants to have DMA_CMA, however, can deal with some cases
where it is not available.

Let's introduce WANT_DMA_CMA and use it in most cases where drivers
select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because
of recursive dependency issues).

We'll assume that any driver that selects DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible.

With this change, distributions can disable CONFIG_CMA or
CONFIG_DMA_CMA, without it silently getting enabled again by random
drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA
and CONFIG_DMA_CMA if they are unspecified and any driver is around that
selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER.

For example, if any driver selects WANT_DMA_CMA and we do a
"make olddefconfig":

1. With "# CONFIG_CMA is not set" and no specification of
"CONFIG_DMA_CMA"

-> CONFIG_DMA_CMA won't be part of .config

2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA

Contiguous Memory Allocator (CMA) [Y/n/?] (NEW)
DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW)

3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set"

-> CONFIG_DMA_CMA will be removed from .config

Note: drivers/remoteproc seems to be special; commit c51e882cd711
("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains that
there is a real dependency to DMA_CMA for it to work; leave that dependency
in place and don't convert it to a soft dependency.


I don't think this dependency is fundamentally different from the others,
though davinci machines tend to have less memory than a lot of the
other machines, so it's more likely to fail without CMA.



I was also unsure - and Lucas had similar thoughts. If you want, I can 
send a v4 also taking care of this.


Thanks!


Regardless of this,

Reviewed-by: Arnd Bergmann 




--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA

2021-04-09 Thread David Hildenbrand
Random drivers should not override a user configuration of core knobs
(e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA,
which depends on CMA, if possible; however, these drivers also have to
tolerate if DMA_CMA is not available/functioning, for example, if no CMA
area for DMA_CMA use has been setup via "cma=X". In the worst case, the
driver cannot do it's job properly in some configurations.

For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and DMA_CMA if
available") documents
While this is no build dependency, etnaviv will only work correctly
on most systems if CMA and DMA_CMA are enabled. Select both options
if available to avoid users ending up with a non-working GPU due to
a lacking kernel config.
So etnaviv really wants to have DMA_CMA, however, can deal with some cases
where it is not available.

Let's introduce WANT_DMA_CMA and use it in most cases where drivers
select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because
of recursive dependency issues).

We'll assume that any driver that selects DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible.

With this change, distributions can disable CONFIG_CMA or
CONFIG_DMA_CMA, without it silently getting enabled again by random
drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA
and CONFIG_DMA_CMA if they are unspecified and any driver is around that
selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER.

For example, if any driver selects WANT_DMA_CMA and we do a
"make olddefconfig":

1. With "# CONFIG_CMA is not set" and no specification of
   "CONFIG_DMA_CMA"

-> CONFIG_DMA_CMA won't be part of .config

2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA

Contiguous Memory Allocator (CMA) [Y/n/?] (NEW)
DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW)

3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set"

-> CONFIG_DMA_CMA will be removed from .config

Note: drivers/remoteproc seems to be special; commit c51e882cd711
("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains that
there is a real dependency to DMA_CMA for it to work; leave that dependency
in place and don't convert it to a soft dependency.

Cc: Maarten Lankhorst 
Cc: Maxime Ripard 
Cc: Thomas Zimmermann 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Joel Stanley 
Cc: Andrew Jeffery 
Cc: Lucas Stach 
Cc: Russell King 
Cc: Christian Gmeiner 
Cc: Paul Cercueil 
Cc: Linus Walleij 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Arnd Bergmann 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Eric Anholt 
Cc: Michal Simek 
Cc: Masahiro Yamada 
Cc: "Alexander A. Klimov" 
Cc: Peter Collingbourne 
Cc: Suman Anna 
Cc: Jason Gunthorpe 
Cc: dri-de...@lists.freedesktop.org
Cc: linux-asp...@lists.ozlabs.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: etna...@lists.freedesktop.org
Cc: linux-m...@vger.kernel.org
Cc: linux-fb...@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: David Hildenbrand 
---

Let's see if this approach is better for soft dependencies (and if we
actually have some hard dependencies in there). This is the follow-up
of
  https://lkml.kernel.org/r/20210408092011.52763-1-da...@redhat.com
  https://lkml.kernel.org/r/20210408100523.63356-1-da...@redhat.com

I was wondering if it would make sense in some drivers to warn if either
CONFIG_DMA_CMA is not available or if DRM_CMA has not been configured
properly - just to give people a heads up that something might more likely
go wrong; that would, however, be future work.

v2 -> v3:
- Don't use "imply" but instead use a new WANT_DMA_CMA and make the default
  of CMA and DMA_CMA depend on it.
- Also adjust ingenic, mcde, tve200; these sound like soft dependencies as
  well (although DMA_CMA is really desired)

v1 -> v2:
- Fix DRM_CMA -> DMA_CMA

---
 drivers/gpu/drm/Kconfig | 2 ++
 drivers/gpu/drm/aspeed/Kconfig  | 2 --
 drivers/gpu/drm/etnaviv/Kconfig | 3 +--
 drivers/gpu/drm/ingenic/Kconfig | 1 -
 drivers/gpu/drm/mcde/Kconfig| 1 -
 drivers/gpu/drm/tve200/Kconfig  | 1 -
 drivers/video/fbdev/Kconfig | 2 +-
 kernel/dma/Kconfig  | 7 +++
 mm/Kconfig  | 1 +
 9 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 85b79a7fee63..6f9989adfa93 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -201,12 +201,14 @@ config DRM_TTM_HELPER
 config DRM_GEM_CMA_HELPER
bool
depends on DRM
+   select WANT_DMA_CMA
help
  Choose this if you need the GEM CMA helper functions
 
 config DRM_KMS_CMA_HELPER
bool
depends on DRM
+   select WANT_DMA_CMA
select DRM_GEM_CMA_HELPER
help
  Choose this if

Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE

2021-02-16 Thread David Hildenbrand



But again, if there are valid use cases then sure, let's make the code fully 
compatible with HUGETLB_PAGE_ORDER > MAX_ORDER.


Given that gigantic HugeTLB allocation can fallback on alloc_contig_pages()
or CMA if/when available, is there a real need for HUGETLB_PAGE_ORDER to be
upto MAX_ORDER, used as pageblock_order or as COMPACTION_HPAGE_ORDER ? With
gigantic HugeTLB pages being available, HUGETLB_PAGE_ORDER seems to be just
detached from the buddy allocator. But I am not sure, will keep looking.


Having HPAGE_PMD_ORDER >>= MAX_ORDER ("significantly larger") will make 
it very unlikely that you are able to reliably allocate any huge pages 
at all dynamically at runtime without CMA.


Gigantic pages are problematic by nature :)

--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE

2021-02-12 Thread David Hildenbrand

On 12.02.21 08:02, Anshuman Khandual wrote:


On 2/11/21 2:07 PM, David Hildenbrand wrote:

On 11.02.21 07:22, Anshuman Khandual wrote:

The following warning gets triggered while trying to boot a 64K page size
without THP config kernel on arm64 platform.

WARNING: CPU: 5 PID: 124 at mm/vmstat.c:1080 __fragmentation_index+0xa4/0xc0
Modules linked in:
CPU: 5 PID: 124 Comm: kswapd0 Not tainted 5.11.0-rc6-4-ga0ea7d62002 #159
Hardware name: linux,dummy-virt (DT)
[    8.810673] pstate: 2045 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[    8.811732] pc : __fragmentation_index+0xa4/0xc0
[    8.812555] lr : fragmentation_index+0xf8/0x138
[    8.813360] sp : 864079b0
[    8.813958] x29: 864079b0 x28: 0372
[    8.814901] x27: 7682 x26: 8000135b3948
[    8.815847] x25: 1fffe00010c80f48 x24: 
[    8.816805] x23:  x22: 000d
[    8.817764] x21: 0030 x20: 0005ffcb4d58
[    8.818712] x19: 000b x18: 
[    8.819656] x17:  x16: 
[    8.820613] x15:  x14: 8000114c6258
[    8.821560] x13: 6000bff969ba x12: 1fffe000bff969b9
[    8.822514] x11: 1fffe000bff969b9 x10: 6000bff969b9
[    8.823461] x9 : dfff8000 x8 : 0005ffcb4dcf
[    8.824415] x7 : 0001 x6 : 41b58ab3
[    8.825359] x5 : 600010c80f48 x4 : dfff8000
[    8.826313] x3 : 8000102be670 x2 : 0007
[    8.827259] x1 : 86407a60 x0 : 000d
[    8.828218] Call trace:
[    8.828667]  __fragmentation_index+0xa4/0xc0
[    8.829436]  fragmentation_index+0xf8/0x138
[    8.830194]  compaction_suitable+0x98/0xb8
[    8.830934]  wakeup_kcompactd+0xdc/0x128
[    8.831640]  balance_pgdat+0x71c/0x7a0
[    8.832327]  kswapd+0x31c/0x520
[    8.832902]  kthread+0x224/0x230
[    8.833491]  ret_from_fork+0x10/0x30
[    8.834150] ---[ end trace 472836f79c15516b ]---

This warning comes from __fragmentation_index() when the requested order
is greater than MAX_ORDER.

static int __fragmentation_index(unsigned int order,
  struct contig_page_info *info)
{
  unsigned long requested = 1UL << order;

  if (WARN_ON_ONCE(order >= MAX_ORDER)) <= Triggered here
  return 0;

Digging it further reveals that pageblock_order has been assigned a value
which is greater than MAX_ORDER failing the above check. But why this
happened ? Because HUGETLB_PAGE_ORDER for the given config on arm64 is
greater than MAX_ORDER.

The solution involves enabling HUGETLB_PAGE_SIZE_VARIABLE which would make
pageblock_order a variable instead of constant HUGETLB_PAGE_ORDER. But that
change alone also did not really work as pageblock_order still got assigned
as HUGETLB_PAGE_ORDER in set_pageblock_order(). HUGETLB_PAGE_ORDER needs to
be less than MAX_ORDER for its appropriateness as pageblock_order otherwise
just fallback to MAX_ORDER - 1 as before. While here it also fixes a build
problem via type casting MAX_ORDER in rmem_cma_setup().


I'm wondering, is there any real value in allowing FORCE_MAX_ZONEORDER to be 
"11" with ARM64_64K_PAGES/ARM64_16K_PAGES?


MAX_ORDER should be as high as would be required for the current config.
Unless THP is enabled, there is no need for it to be any higher than 11.
But I might be missing historical reasons around this as well. Probably
others from arm64 could help here.


Theoretically yes, practically no. If nobody cares about a 
configuration, no need to make the code more complicated for that 
configuration.






Meaning: are there any real use cases that actually build a kernel without 
TRANSPARENT_HUGEPAGE and with ARM64_64K_PAGES/ARM64_16K_PAGES?


THP is always optional. Besides kernel builds without THP should always
be supported. Assuming that all builds will have THP enabled, might not
be accurate.



As builds are essentially broken, I assume this is not that relevant? Or how 
long has it been broken?


Git blame shows that it's been there for some time now. But how does
that make this irrelevant ? A problem should be fixed nonetheless.


When exactly did I say not to fix it? I'm saying if nobody uses it, we 
might be able to simplify.






It might be easier to just drop the "TRANSPARENT_HUGEPAGE" part from the 
FORCE_MAX_ZONEORDER config.



Not sure if it would be a good idea to unnecessarily have larger MAX_ORDER
value for a given config. But I might be missing other contexts here.


My point is: keep it simple if there is no need to make it complicated. 
If these arm64 variants are the only cases where we run into that issue 
and nobody uses them ("hat it's been there for some time now"), why make 
stuff complicated?


The current code seems to assume that HUGETLB_PAGE_ORDER <= MAX_ORDER. 
Instead of changing that for optimizing an unused use case (it is 
broken), just simplify the arm64 conditions.

Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-11 Thread David Hildenbrand


Again in proper SVA it should be quite unlikely to take a fault caused
by something like migration, on the same likelyhood as the CPU. If
things are faulting so much this is a problem then I think it is a
system level problem with doing too much page motion.


My point is that single one SVA application shouldn't require system
to make global changes, such as disabling numa balancing, disabling
THP, to decrease page fault frequency by affecting other applications.

Anyway, guys are in lunar new year. Hopefully, we are getting more
real benchmark data afterwards to make the discussion more targeted.


Right, but I think functionality as proposed in this patch is highly 
unlikely to make it into the kernel. I'd be interested in approaches to 
mitigate this per process. E.g., temporarily stop the kernel from 
messing with THP of this specific process.


But even then, why should some random library make such decisions for a 
whole process? Just as, why should some random library pin pages never 
allocated by it and stop THP from being created or NUMA layout from 
getting optimized? This looks like a clear layer violation to me.


I fully agree with Jason: Why do the events happen that often such that 
your use cases are affected that heavily, such that we even need such 
ugly handling?


What mempinfd does is exposing dangerous functionality that we don't 
want 99.6% of all user space to ever use via a syscall to generic 
users to fix broken* hw.


*broken might be over-stressing the situation, but the HW (SVA) 
obviously seems to perform way worse than ordinary CPUs.


--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] mm/page_alloc: Fix pageblock_order with HUGETLB_PAGE_SIZE_VARIABLE

2021-02-11 Thread David Hildenbrand

On 11.02.21 07:22, Anshuman Khandual wrote:

The following warning gets triggered while trying to boot a 64K page size
without THP config kernel on arm64 platform.

WARNING: CPU: 5 PID: 124 at mm/vmstat.c:1080 __fragmentation_index+0xa4/0xc0
Modules linked in:
CPU: 5 PID: 124 Comm: kswapd0 Not tainted 5.11.0-rc6-4-ga0ea7d62002 #159
Hardware name: linux,dummy-virt (DT)
[8.810673] pstate: 2045 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[8.811732] pc : __fragmentation_index+0xa4/0xc0
[8.812555] lr : fragmentation_index+0xf8/0x138
[8.813360] sp : 864079b0
[8.813958] x29: 864079b0 x28: 0372
[8.814901] x27: 7682 x26: 8000135b3948
[8.815847] x25: 1fffe00010c80f48 x24: 
[8.816805] x23:  x22: 000d
[8.817764] x21: 0030 x20: 0005ffcb4d58
[8.818712] x19: 000b x18: 
[8.819656] x17:  x16: 
[8.820613] x15:  x14: 8000114c6258
[8.821560] x13: 6000bff969ba x12: 1fffe000bff969b9
[8.822514] x11: 1fffe000bff969b9 x10: 6000bff969b9
[8.823461] x9 : dfff8000 x8 : 0005ffcb4dcf
[8.824415] x7 : 0001 x6 : 41b58ab3
[8.825359] x5 : 600010c80f48 x4 : dfff8000
[8.826313] x3 : 8000102be670 x2 : 0007
[8.827259] x1 : 86407a60 x0 : 000d
[8.828218] Call trace:
[8.828667]  __fragmentation_index+0xa4/0xc0
[8.829436]  fragmentation_index+0xf8/0x138
[8.830194]  compaction_suitable+0x98/0xb8
[8.830934]  wakeup_kcompactd+0xdc/0x128
[8.831640]  balance_pgdat+0x71c/0x7a0
[8.832327]  kswapd+0x31c/0x520
[8.832902]  kthread+0x224/0x230
[8.833491]  ret_from_fork+0x10/0x30
[8.834150] ---[ end trace 472836f79c15516b ]---

This warning comes from __fragmentation_index() when the requested order
is greater than MAX_ORDER.

static int __fragmentation_index(unsigned int order,
 struct contig_page_info *info)
{
 unsigned long requested = 1UL << order;

 if (WARN_ON_ONCE(order >= MAX_ORDER)) <= Triggered here
 return 0;

Digging it further reveals that pageblock_order has been assigned a value
which is greater than MAX_ORDER failing the above check. But why this
happened ? Because HUGETLB_PAGE_ORDER for the given config on arm64 is
greater than MAX_ORDER.

The solution involves enabling HUGETLB_PAGE_SIZE_VARIABLE which would make
pageblock_order a variable instead of constant HUGETLB_PAGE_ORDER. But that
change alone also did not really work as pageblock_order still got assigned
as HUGETLB_PAGE_ORDER in set_pageblock_order(). HUGETLB_PAGE_ORDER needs to
be less than MAX_ORDER for its appropriateness as pageblock_order otherwise
just fallback to MAX_ORDER - 1 as before. While here it also fixes a build
problem via type casting MAX_ORDER in rmem_cma_setup().


I'm wondering, is there any real value in allowing FORCE_MAX_ZONEORDER 
to be "11" with ARM64_64K_PAGES/ARM64_16K_PAGES?


Meaning: are there any real use cases that actually build a kernel 
without TRANSPARENT_HUGEPAGE and with ARM64_64K_PAGES/ARM64_16K_PAGES?


As builds are essentially broken, I assume this is not that relevant? Or 
how long has it been broken?


It might be easier to just drop the "TRANSPARENT_HUGEPAGE" part from the 
FORCE_MAX_ZONEORDER config.


--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread David Hildenbrand

On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:




-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
David Hildenbrand
Sent: Monday, February 8, 2021 9:22 PM
To: Song Bao Hua (Barry Song) ; Matthew Wilcox

Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
iommu@lists.linux-foundation.org; linux...@kvack.org;
linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
Morton ; Alexander Viro ;
gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
; zhangfei@linaro.org; chensihang (A)

Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
pin

On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:




-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf

Of

Matthew Wilcox
Sent: Monday, February 8, 2021 2:31 PM
To: Song Bao Hua (Barry Song) 
Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
iommu@lists.linux-foundation.org; linux...@kvack.org;
linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
Morton ; Alexander Viro

;

gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
; zhangfei@linaro.org; chensihang (A)

Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
pin

On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:

In high-performance I/O cases, accelerators might want to perform
I/O on a memory without IO page faults which can result in dramatically
increased latency. Current memory related APIs could not achieve this
requirement, e.g. mlock can only avoid memory to swap to backup device,
page migration can still trigger IO page fault.


Well ... we have two requirements.  The application wants to not take
page faults.  The system wants to move the application to a different
NUMA node in order to optimise overall performance.  Why should the
application's desires take precedence over the kernel's desires?  And why
should it be done this way rather than by the sysadmin using numactl to
lock the application to a particular node?


NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.


You're dodging the question.  Should the CMA allocation fail because
another application is using SVA?

I would say no.


I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA.

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.



Long term pinnings cannot go onto CMA-reserved memory, and there is
similar work to also fix ZONE_MOVABLE in that regard.

https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c
om

One of the reasons I detest using long term pinning of pages where it
could be avoided. Take VFIO and RDMA as an example: these things
currently can't work without them.

What I read here: "DMA performance will be affected severely". That does
not sound like a compelling argument to me for long term pinnings.
Please find another way to achieve the same goal without long term
pinnings controlled by user space - e.g., controlling when migration
actually happens.

For example, CMA/alloc_contig_range()/memory unplug are corner cases
that happen rarely, you shouldn't have to worry about them messing with
your DMA performance.


I agree CMA/alloc_contig_range()/memory unplug would be corner cases,

Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread David Hildenbrand

On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:




-Original Message-
From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
Matthew Wilcox
Sent: Monday, February 8, 2021 2:31 PM
To: Song Bao Hua (Barry Song) 
Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
iommu@lists.linux-foundation.org; linux...@kvack.org;
linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
Morton ; Alexander Viro ;
gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
; zhangfei@linaro.org; chensihang (A)

Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
pin

On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:

In high-performance I/O cases, accelerators might want to perform
I/O on a memory without IO page faults which can result in dramatically
increased latency. Current memory related APIs could not achieve this
requirement, e.g. mlock can only avoid memory to swap to backup device,
page migration can still trigger IO page fault.


Well ... we have two requirements.  The application wants to not take
page faults.  The system wants to move the application to a different
NUMA node in order to optimise overall performance.  Why should the
application's desires take precedence over the kernel's desires?  And why
should it be done this way rather than by the sysadmin using numactl to
lock the application to a particular node?


NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.


You're dodging the question.  Should the CMA allocation fail because
another application is using SVA?

I would say no.


I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA.

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.



Long term pinnings cannot go onto CMA-reserved memory, and there is 
similar work to also fix ZONE_MOVABLE in that regard.


https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatas...@soleen.com

One of the reasons I detest using long term pinning of pages where it 
could be avoided. Take VFIO and RDMA as an example: these things 
currently can't work without them.


What I read here: "DMA performance will be affected severely". That does 
not sound like a compelling argument to me for long term pinnings. 
Please find another way to achieve the same goal without long term 
pinnings controlled by user space - e.g., controlling when migration 
actually happens.


For example, CMA/alloc_contig_range()/memory unplug are corner cases 
that happen rarely, you shouldn't have to worry about them messing with 
your DMA performance.


--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread David Hildenbrand

On 07.02.21 09:18, Zhou Wang wrote:

SVA(share virtual address) offers a way for device to share process virtual
address space safely, which makes more convenient for user space device
driver coding. However, IO page faults may happen when doing DMA
operations. As the latency of IO page fault is relatively big, DMA
performance will be affected severely when there are IO page faults.
 From a long term view, DMA performance will be not stable.

In high-performance I/O cases, accelerators might want to perform
I/O on a memory without IO page faults which can result in dramatically
increased latency. Current memory related APIs could not achieve this
requirement, e.g. mlock can only avoid memory to swap to backup device,
page migration can still trigger IO page fault.

Various drivers working under traditional non-SVA mode are using
their own specific ioctl to do pin. Such ioctl can be seen in v4l2,
gpu, infiniband, media, vfio, etc. Drivers are usually doing dma
mapping while doing pin.

But, in SVA mode, pin could be a common need which isn't necessarily
bound with any drivers, and neither is dma mapping needed by drivers
since devices are using the virtual address of CPU. Thus, It is better
to introduce a new common syscall for it.

This patch leverages the design of userfaultfd and adds mempinfd for pin
to avoid messing up mm_struct. A fd will be got by mempinfd, then user
space can do pin/unpin pages by ioctls of this fd, all pinned pages under
one file will be unpinned in file release process. Like pin page cases in
other places, can_do_mlock is used to check permission and input
parameters.

Signed-off-by: Zhou Wang 
Signed-off-by: Sihang Chen 
Suggested-by: Barry Song 
---
  arch/arm64/include/asm/unistd.h   |   2 +-
  arch/arm64/include/asm/unistd32.h |   2 +
  fs/Makefile   |   1 +
  fs/mempinfd.c | 199 ++
  include/linux/syscalls.h  |   1 +
  include/uapi/asm-generic/unistd.h |   4 +-
  include/uapi/linux/mempinfd.h |  23 +
  init/Kconfig  |   6 ++
  8 files changed, 236 insertions(+), 2 deletions(-)
  create mode 100644 fs/mempinfd.c
  create mode 100644 include/uapi/linux/mempinfd.h

diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 86a9d7b3..949788f 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
  #define __ARM_NR_compat_set_tls   (__ARM_NR_COMPAT_BASE + 5)
  #define __ARM_NR_COMPAT_END   (__ARM_NR_COMPAT_BASE + 0x800)
  
-#define __NR_compat_syscalls		442

+#define __NR_compat_syscalls   443
  #endif
  
  #define __ARCH_WANT_SYS_CLONE

diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index cccfbbe..3f49529 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -891,6 +891,8 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
  __SYSCALL(__NR_process_madvise, sys_process_madvise)
  #define __NR_epoll_pwait2 441
  __SYSCALL(__NR_epoll_pwait2, compat_sys_epoll_pwait2)
+#define __NR_mempinfd 442
+__SYSCALL(__NR_mempinfd, sys_mempinfd)
  
  /*

   * Please add new compat syscalls above this comment and update
diff --git a/fs/Makefile b/fs/Makefile
index 999d1a2..e1cbf12 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_COREDUMP)+= coredump.o
  obj-$(CONFIG_SYSCTL)  += drop_caches.o
  
  obj-$(CONFIG_FHANDLE)		+= fhandle.o

+obj-$(CONFIG_MEMPINFD) += mempinfd.o
  obj-y += iomap/
  
  obj-y+= quota/

diff --git a/fs/mempinfd.c b/fs/mempinfd.c
new file mode 100644
index 000..23d3911
--- /dev/null
+++ b/fs/mempinfd.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021 HiSilicon Limited. */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct mem_pin_container {
+   struct xarray array;
+   struct mutex lock;
+};
+
+struct pin_pages {
+   unsigned long first;
+   unsigned long nr_pages;
+   struct page **pages;
+};
+
+static int mempinfd_release(struct inode *inode, struct file *file)
+{
+   struct mem_pin_container *priv = file->private_data;
+   struct pin_pages *p;
+   unsigned long idx;
+
+   xa_for_each(>array, idx, p) {
+   unpin_user_pages(p->pages, p->nr_pages);
+   xa_erase(>array, p->first);
+   vfree(p->pages);
+   kfree(p);
+   }
+
+   mutex_destroy(>lock);
+   xa_destroy(>array);
+   kfree(priv);
+
+   return 0;
+}
+
+static int mempinfd_input_check(u64 addr, u64 size)
+{
+   if (!size || addr + size < addr)
+   return -EINVAL;
+
+   if (!can_do_mlock())
+   return -EPERM;
+
+   return 0;
+}
+
+static int mem_pin_page(struct mem_pin_container *priv,
+   struct 

Re: [RFC 2/3] arm64/hugetlb: Enable HUGETLB_PAGE_SIZE_VARIABLE

2021-02-05 Thread David Hildenbrand

On 04.02.21 08:01, Anshuman Khandual wrote:

MAX_ORDER which invariably depends on FORCE_MAX_ZONEORDER can be a variable
for a given page size, depending on whether TRANSPARENT_HUGEPAGE is enabled
or not. In certain page size and THP combinations HUGETLB_PAGE_ORDER can be
greater than MAX_ORDER, making it unusable as pageblock_order.


Just so I understand correctly, this does not imply that we have THP 
that exceed the pageblock size / MAX_ORDER size, correct?





This enables HUGETLB_PAGE_SIZE_VARIABLE making pageblock_order a variable
rather than the compile time constant HUGETLB_PAGE_ORDER which could break
MAX_ORDER rule for certain configurations.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
  arch/arm64/Kconfig | 4 
  1 file changed, 4 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 175914f2f340..c4acf8230f20 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1918,6 +1918,10 @@ config ARCH_ENABLE_THP_MIGRATION
def_bool y
depends on TRANSPARENT_HUGEPAGE
  
+config HUGETLB_PAGE_SIZE_VARIABLE

+   def_bool y
+   depends on HUGETLB_PAGE
+
  menu "Power management options"
  
  source "kernel/power/Kconfig"





--
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/mediatek: check 4GB mode by reading infracfg

2020-07-21 Thread David Hildenbrand
On 21.07.20 04:16, Miles Chen wrote:
> In previous discussion [1] and [2], we found that it is risky to
> use max_pfn or totalram_pages to tell if 4GB mode is enabled.
> 
> Check 4GB mode by reading infracfg register, remove the usage
> of the un-exported symbol max_pfn.
> 
> This is a step towards building mtk_iommu as a kernel module.
> 
> Change since v1:
> 1. remove the phandle usage, search for infracfg instead [3]
> 2. use infracfg instead of infracfg_regmap
> 3. move infracfg definitaions to linux/soc/mediatek/infracfg.h
> 4. update enable_4GB only when has_4gb_mode

Nit: We tend to place such information below the "---" (adding a second
one) such that this information is discarded when the patch is picked up.

> 
> [1] https://lkml.org/lkml/2020/6/3/733
> [2] https://lkml.org/lkml/2020/6/4/136
> [3] https://lkml.org/lkml/2020/7/15/1147
> 
> Cc: Mike Rapoport 
> Cc: David Hildenbrand 
> Cc: Yong Wu 
> Cc: Yingjoe Chen 
> Cc: Christoph Hellwig 
> Cc: Yong Wu 
> Cc: Chao Hao 
> Cc: Rob Herring 
> Cc: Matthias Brugger 
> Signed-off-by: Miles Chen 
> ---
>  drivers/iommu/mtk_iommu.c | 26 +-
>  include/linux/soc/mediatek/infracfg.h |  3 +++
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> index 2be96f1cdbd2..16765f532853 100644
> --- a/drivers/iommu/mtk_iommu.c
> +++ b/drivers/iommu/mtk_iommu.c
> @@ -3,7 +3,6 @@
>   * Copyright (c) 2015-2016 MediaTek Inc.
>   * Author: Yong Wu 
>   */
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -15,13 +14,16 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -599,8 +601,10 @@ static int mtk_iommu_probe(struct platform_device *pdev)
>   struct resource *res;
>   resource_size_t ioaddr;
>   struct component_match  *match = NULL;
> + struct regmap   *infracfg;
>   void*protect;
>   int i, larb_nr, ret;
> + u32 val;
>  
>   data = devm_kzalloc(dev, sizeof(*data), GFP_KERNEL);
>   if (!data)
> @@ -614,10 +618,22 @@ static int mtk_iommu_probe(struct platform_device *pdev)
>   return -ENOMEM;
>   data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN);
>  
> - /* Whether the current dram is over 4GB */
> - data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT));
> - if (!data->plat_data->has_4gb_mode)
> - data->enable_4GB = false;
> + data->enable_4GB = false;
> + if (data->plat_data->has_4gb_mode) {
> + infracfg = syscon_regmap_lookup_by_compatible(
> + "mediatek,mt8173-infracfg");
> + if (IS_ERR(infracfg)) {
> + infracfg = syscon_regmap_lookup_by_compatible(
> + "mediatek,mt2712-infracfg");
> + if (IS_ERR(infracfg))
> + return PTR_ERR(infracfg);
> +
> + }
> + ret = regmap_read(infracfg, REG_INFRA_MISC, );
> + if (ret)
> + return ret;
> + data->enable_4GB = !!(val & F_DDR_4GB_SUPPORT_EN);

(I am absolutely not familiar with syscon_regmap_lookup_by ..., I am
missing some context, so sorry for the stupid questions)

Who sets the regmap value and based on what? Who decides that 4GB mode
is supported or not? And who decides if 4GB mode is *required* or not?

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB

2020-06-05 Thread David Hildenbrand
On 04.06.20 17:27, David Hildenbrand wrote:
> On 04.06.20 17:06, Christoph Hellwig wrote:
>> On Thu, Jun 04, 2020 at 01:32:40PM +0200, David Hildenbrand wrote:
>>> Just a thought: If memory hotplug is applicable as well, you might
>>> either want to always assume data->enable_4GB, or handle memory hotplug
>>> events from the memory notifier, when new memory gets onlined (not sure
>>> how tricky that is).
>>
>> We probably want a highest_pfn_possible() or similar API instead of
>> having drivers poking into random VM internals.
> 
> Well, memory notifiers are a reasonable api used accross the kernel to
> get notified when new memory is onlined to the buddy that could be used
> for allocations.
> 
> highest_pfn_possible() would have to default to something linked to
> MAX_PHYSMEM_BITS whenever memory hotplug is configured, I am not sure
> how helpful that is (IOW, you can just default to enable_4GB=true in
> that case instead in most cases).

Correction: At least on x86-64 we have max_possible_pfn, which will
consult the ACPI SRAT table to figure out the maximum possible PFN.
(Without SRAT, max_possible_pfn will point at the end of initial boot
memory and not consider hotplug memory - something that e.g., newer QEMU
versions work around by creating SRAT tables if memory hotplug might be
possible, even if there is no actual NUMA configuration).

pci-swiotlb.c similarly relies on that to figure out if there are any
!DMA addresses to handle.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB

2020-06-04 Thread David Hildenbrand
On 04.06.20 17:06, Christoph Hellwig wrote:
> On Thu, Jun 04, 2020 at 01:32:40PM +0200, David Hildenbrand wrote:
>> Just a thought: If memory hotplug is applicable as well, you might
>> either want to always assume data->enable_4GB, or handle memory hotplug
>> events from the memory notifier, when new memory gets onlined (not sure
>> how tricky that is).
> 
> We probably want a highest_pfn_possible() or similar API instead of
> having drivers poking into random VM internals.

Well, memory notifiers are a reasonable api used accross the kernel to
get notified when new memory is onlined to the buddy that could be used
for allocations.

highest_pfn_possible() would have to default to something linked to
MAX_PHYSMEM_BITS whenever memory hotplug is configured, I am not sure
how helpful that is (IOW, you can just default to enable_4GB=true in
that case instead in most cases).

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB

2020-06-04 Thread David Hildenbrand
On 04.06.20 11:49, Miles Chen wrote:
> On Thu, 2020-06-04 at 10:25 +0200, David Hildenbrand wrote:
>> On 04.06.20 10:01, Miles Chen wrote:
>>> To build this driver as a kernel module, we cannot use
>>> the unexported symbol "max_pfn" to setup enable_4GB.
>>>
>>> Use totalram_pages() instead to setup enable_4GB.
>>>
>>> Suggested-by: Mike Rapoport 
>>> Signed-off-by: Miles Chen 
>>> Cc: David Hildenbrand 
>>> Cc: Yong Wu 
>>> Cc: Chao Hao 
>>> ---
>>>  drivers/iommu/mtk_iommu.c | 5 ++---
>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
>>> index 5f4d6df59cf6..c2798a6e0e38 100644
>>> --- a/drivers/iommu/mtk_iommu.c
>>> +++ b/drivers/iommu/mtk_iommu.c
>>> @@ -3,7 +3,6 @@
>>>   * Copyright (c) 2015-2016 MediaTek Inc.
>>>   * Author: Yong Wu 
>>>   */
>>> -#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -626,8 +625,8 @@ static int mtk_iommu_probe(struct platform_device *pdev)
>>> return -ENOMEM;
>>> data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN);
>>>  
>>> -   /* Whether the current dram is over 4GB */
>>> -   data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT));
>>> +   /* Whether the current dram is over 4GB, note: DRAM start at 1GB  */
>>> +   data->enable_4GB = !!(totalram_pages() > ((SZ_2G + SZ_1G) >> 
>>> PAGE_SHIFT));
>>
>> A similar thing seems to be done by
>> drivers/media/platform/mtk-vpu/mtk_vpu.c:
>>  vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT));
>>
>> I do wonder if some weird memory hotplug setups might give you false
>> negatives.
>>
>> E.g., start a VM with 1GB and hotplug 1GB - it will be hotplugged on
>> x86-64 above 4GB, turning max_pfn into 5GB. totalram_pages() should
>> return something < 2GB.
>>
>> Same can happen when you have a VM and use ballooning to fake-unplug
>> memory, making totalram_pages() return something < 4GB, but leaving
>> usable pfns >= 4GB
> 
> Yes. Yingjoe also told me that this patch is not correct.
> 
> Thanks for pointing this out. totalram_pages() does not work 
> for some cases:
> 

Just a thought: If memory hotplug is applicable as well, you might
either want to always assume data->enable_4GB, or handle memory hotplug
events from the memory notifier, when new memory gets onlined (not sure
how tricky that is).


-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/mediatek: Use totalram_pages to setup enable_4GB

2020-06-04 Thread David Hildenbrand
On 04.06.20 10:01, Miles Chen wrote:
> To build this driver as a kernel module, we cannot use
> the unexported symbol "max_pfn" to setup enable_4GB.
> 
> Use totalram_pages() instead to setup enable_4GB.
> 
> Suggested-by: Mike Rapoport 
> Signed-off-by: Miles Chen 
> Cc: David Hildenbrand 
> Cc: Yong Wu 
> Cc: Chao Hao 
> ---
>  drivers/iommu/mtk_iommu.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> index 5f4d6df59cf6..c2798a6e0e38 100644
> --- a/drivers/iommu/mtk_iommu.c
> +++ b/drivers/iommu/mtk_iommu.c
> @@ -3,7 +3,6 @@
>   * Copyright (c) 2015-2016 MediaTek Inc.
>   * Author: Yong Wu 
>   */
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -626,8 +625,8 @@ static int mtk_iommu_probe(struct platform_device *pdev)
>   return -ENOMEM;
>   data->protect_base = ALIGN(virt_to_phys(protect), MTK_PROTECT_PA_ALIGN);
>  
> - /* Whether the current dram is over 4GB */
> - data->enable_4GB = !!(max_pfn > (BIT_ULL(32) >> PAGE_SHIFT));
> + /* Whether the current dram is over 4GB, note: DRAM start at 1GB  */
> + data->enable_4GB = !!(totalram_pages() > ((SZ_2G + SZ_1G) >> 
> PAGE_SHIFT));

A similar thing seems to be done by
drivers/media/platform/mtk-vpu/mtk_vpu.c:
vpu->enable_4GB = !!(totalram_pages() > (SZ_2G >> PAGE_SHIFT));

I do wonder if some weird memory hotplug setups might give you false
negatives.

E.g., start a VM with 1GB and hotplug 1GB - it will be hotplugged on
x86-64 above 4GB, turning max_pfn into 5GB. totalram_pages() should
return something < 2GB.

Same can happen when you have a VM and use ballooning to fake-unplug
memory, making totalram_pages() return something < 4GB, but leaving
usable pfns >= 4GB.

but
... I don't know if I understood what "enable_4GB" needs/implies
... I don't know if this is applicable to VMs at all (on real HW such
memory hotplug setups should not exist)
... I don't know how this code would react to memory hotplug, so if the
condition changes after the driver loaded and enable_4GB would
suddenly apply. Again, most probably not relevant on real HW, only
for VMs.

-- 
Thanks,

David / dhildenb

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V2] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-26 Thread David Hildenbrand
On 26.11.18 13:26, Anshuman Khandual wrote:
> At present there are multiple places where invalid node number is encoded
> as -1. Even though implicitly understood it is always better to have macros
> in there. Replace these open encodings for an invalid node number with the
> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
> 'invalid node' from various places redirecting them to a common definition.
> 
> Signed-off-by: Anshuman Khandual 
> ---
> Changes in V2:
> 
> - Added inclusion of 'numa.h' header at various places per Andrew
> - Updated 'dev_to_node' to use NUMA_NO_NODE instead per Vinod

Reviewed-by: David Hildenbrand 

-- 

Thanks,

David / dhildenb
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-23 Thread David Hildenbrand
On 23.11.18 10:54, Anshuman Khandual wrote:
> At present there are multiple places where invalid node number is encoded
> as -1. Even though implicitly understood it is always better to have macros
> in there. Replace these open encodings for an invalid node number with the
> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
> 'invalid node' from various places redirecting them to a common definition.
> 
> Signed-off-by: Anshuman Khandual 
> ---
> 
> Changes in V1:
> 
> - Dropped OCFS2 changes per Joseph
> - Dropped media/video drivers changes per Hans
> 
> RFC - https://patchwork.kernel.org/patch/10678035/
> 
> Build tested this with multiple cross compiler options like alpha, sparc,
> arm64, x86, powerpc, powerpc64le etc with their default config which might
> not have compiled tested all driver related changes. I will appreciate
> folks giving this a test in their respective build environment.
> 
> All these places for replacement were found by running the following grep
> patterns on the entire kernel code. Please let me know if this might have
> missed some instances. This might also have replaced some false positives.
> I will appreciate suggestions, inputs and review.
> 
> 1. git grep "nid == -1"
> 2. git grep "node == -1"
> 3. git grep "nid = -1"
> 4. git grep "node = -1"

Hopefully you found most users :)

Did you check if some are encoded into function calls? f(-1, ...)

Reviewed-by: David Hildenbrand 


-- 

Thanks,

David / dhildenb
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu