from:"\"Andrew Morton\""

Re: [PATCH v1 0/3] mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE

2024-06-25 Thread Andrew Morton

afaict we're in decent state to move this series into mm-stable.  I've
tagged the following issues:

https://lkml.kernel.org/r/80532f73e52e2c21fdc9aac7bce24aefb76d11b0.ca...@linux.intel.com
https://lkml.kernel.org/r/30b5d493-b7c2-4e63-86c1-dcc73d21d...@redhat.com

Have these been addressed and are we ready to send this series into the world?

Thanks.

Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

2024-06-11 Thread Andrew Morton

On Tue, 11 Jun 2024 11:42:56 +0200 David Hildenbrand  wrote:

> > We'll leave the ZONE_DEVICE case alone for now.
> > 
> 
> @Andrew, can we add here:
> 
> "Note that self-hosted vmemmap pages will no longer be marked as 
> reserved. This matches ordinary vmemmap pages allocated from the buddy 
> during memory hotplug. Now, really only vmemmap pages allocated from 
> memblock during early boot will be marked reserved. Existing 
> PageReserved() checks seem to be handling all relevant cases correctly 
> even after this change."

Done, thanks.

Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

2024-06-11 Thread Andrew Morton

On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand  wrote:

> On 07.06.24 11:09, David Hildenbrand wrote:
> > In preparation for further changes, let's teach __free_pages_core()
> > about the differences of memory hotplug handling.
> > 
> > Move the memory hotplug specific handling from generic_online_page() to
> > __free_pages_core(), use adjust_managed_page_count() on the memory
> > hotplug path, and spell out why memory freed via memblock
> > cannot currently use adjust_managed_page_count().
> > 
> > Signed-off-by: David Hildenbrand 
> > ---
> 
> @Andrew, can you squash the following?

Sure.

I queued it against "mm: pass meminit_context to __free_pages_core()",
not against

> Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
>   long"

Re: [PATCH v2] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Andrew Morton

On Thu,  7 Apr 2022 14:06:37 +0200 Juergen Gross  wrote:

> Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
> zones with pages managed by the buddy allocator")

Six years ago!

> only zones with free
> memory are included in a built zonelist. This is problematic when e.g.
> all memory of a zone has been ballooned out when zonelists are being
> rebuilt.
> 
> The decision whether to rebuild the zonelists when onlining new memory
> is done based on populated_zone() returning 0 for the zone the memory
> will be added to. The new zone is added to the zonelists only, if it
> has free memory pages (managed_zone() returns a non-zero value) after
> the memory has been onlined. This implies, that onlining memory will
> always free the added pages to the allocator immediately, but this is
> not true in all cases: when e.g. running as a Xen guest the onlined
> new memory will be added only to the ballooned memory list, it will be
> freed only when the guest is being ballooned up afterwards.
> 
> Another problem with using managed_zone() for the decision whether a
> zone is being added to the zonelists is, that a zone with all memory
> used will in fact be removed from all zonelists in case the zonelists
> happen to be rebuilt.
> 
> Use populated_zone() when building a zonelist as it has been done
> before that commit.
> 
> Cc: sta...@vger.kernel.org

Some details, please.  Is this really serious enough to warrant
backporting?  Is some new workload/usage pattern causing people to hit
this?

Re: remove alloc_vm_area v2

2020-09-25 Thread Andrew Morton

On Thu, 24 Sep 2020 15:58:42 +0200 Christoph Hellwig  wrote:

> this series removes alloc_vm_area, which was left over from the big
> vmalloc interface rework.  It is a rather arkane interface, basicaly
> the equivalent of get_vm_area + actually faulting in all PTEs in
> the allocated area.  It was originally addeds for Xen (which isn't
> modular to start with), and then grew users in zsmalloc and i915
> which seems to mostly qualify as abuses of the interface, especially
> for i915 as a random driver should not set up PTE bits directly.
> 
> Note that the i915 patches apply to the drm-tip branch of the drm-tip
> tree, as that tree has recent conflicting commits in the same area.

Is the drm-tip material in linux-next yet?  I'm still seeing a non-trivial
reject in there at present.

Re: [PATCH v4 1/2] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-08-31 Thread Andrew Morton

On Tue, 11 Aug 2020 11:44:46 +0200 Roger Pau Monne  wrote:

> This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
> being used by non DAX devices.

Acked-by: Andrew Morton .

Please add it to the Xen tree when appropriate.

(I'm not sure what David means by "separate type", but we can do that
later if desired.  Dan is taking a taking a bit of downtime).

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Andrew Morton

On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  wrote:

> > 
> > Why does the firmware map support hotplug entries?
> 
> I assume:
> 
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.
> 
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> 
> 
> But I assume only Andrew can enlighten us.
> 
> @Andrew, any guidance here? Should we really add all memory to the
> firmware memmap, even if this contradicts with the existing
> documentation? (especially, if the actual firmware memmap will *not*
> contain that memory after a reboot)

For some reason that patch is misattributed - it was authored by
Shaohui Zheng , who hasn't been heard from in
a decade.  I looked through the email discussion from that time and I'm
not seeing anything useful.  But I wasn't able to locate Dave Hansen's
review comments.

Re: [Xen-devel] [PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline

2019-02-28 Thread Andrew Morton

On Wed, 27 Feb 2019 13:32:14 +0800 Dave Young  wrote:

> This series have been in -next for some days, could we get this in
> mainline? 

It's been in -next for two months?

> Andrew, do you have plan about them, maybe next release?

They're all reviewed except for "xen/balloon: mark inflated pages
PG_offline". 
(https://ozlabs.org/~akpm/mmotm/broken-out/xen-balloon-mark-inflated-pages-pg_offline.patch).
Yes, I plan on sending these to Linus during the merge window for 5.1

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-11-05 Thread Andrew Morton

On Mon, 05 Nov 2018 15:12:27 +0530 Arun KS  wrote:

> On 2018-10-22 16:03, Arun KS wrote:
> > On 2018-10-19 13:37, Michal Hocko wrote:
> >> On Thu 18-10-18 19:18:25, Andrew Morton wrote:
> >> [...]
> >>> So this patch needs more work, yes?
> >> 
> >> Yes, I've talked to Arun (he is offline until next week) offlist and 
> >> he
> >> will play with this some more.
> > 
> > Converted totalhigh_pages, totalram_pages and zone->managed_page to
> > atomic and tested hot add. Latency is not effected with this change.
> > Will send out a separate patch on top of this one.
> Hello Andrew/Michal,
> 
> Will this be going in subsequent -rcs?

I thought were awaiting a new version?  "Will send out a separate patch
on top of this one"?

I do think a resend would be useful, please.  Ensure the changelog is
updated to capture the above info and any other worthy issues which
arose during review.


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-18 Thread Andrew Morton

On Thu, 11 Oct 2018 09:55:03 +0200 Michal Hocko  wrote:

> > > > > This is now not called anymore, although the xen/hv variants still do
> > > > > it. The function seems empty these days, maybe remove it as a followup
> > > > > cleanup?
> > > > >
> > > > > > -   __online_page_increment_counters(page);
> > > > > > -   __online_page_free(page);
> > > > > > +   __free_pages_core(page, order);
> > > > > > +   totalram_pages += (1UL << order);
> > > > > > +#ifdef CONFIG_HIGHMEM
> > > > > > +   if (PageHighMem(page))
> > > > > > +   totalhigh_pages += (1UL << order);
> > > > > > +#endif
> > > > >
> > > > > __online_page_increment_counters() would have used
> > > > > adjust_managed_page_count() which would do the changes under
> > > > > managed_page_count_lock. Are we safe without the lock? If yes, there
> > > > > should perhaps be a comment explaining why.
> > > > 
> > > > Looks unsafe without managed_page_count_lock.
> > > 
> > > Why does it matter actually? We cannot online/offline memory in
> > > parallel. This is not the case for the boot where we initialize memory
> > > in parallel on multiple nodes. So this seems to be safe currently unless
> > > I am missing something. A comment explaining that would be helpful
> > > though.
> > 
> > Other main callers of adjust_manage_page_count(),
> > 
> > static inline void free_reserved_page(struct page *page)
> > {
> > __free_reserved_page(page);
> > adjust_managed_page_count(page, 1);
> > }
> > 
> > static inline void mark_page_reserved(struct page *page)
> > {
> > SetPageReserved(page);
> > adjust_managed_page_count(page, -1);
> > }
> > 
> > Won't they race with memory hotplug?
> > 
> > Few more,
> > ./drivers/xen/balloon.c:519:adjust_managed_page_count(page, -1);
> > ./drivers/virtio/virtio_balloon.c:175:  adjust_managed_page_count(page, -1);
> > ./drivers/virtio/virtio_balloon.c:196:  adjust_managed_page_count(page, 1);
> > ./mm/hugetlb.c:2158:adjust_managed_page_count(page, 1 <<
> > h->order);
> 
> They can, and I have missed those.

So this patch needs more work, yes?

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Andrew Morton

On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko  wrote:

> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > - Undocumented return value.
> > 
> > - comment "failed to reap part..." is misleading - sounds like it's
> >   referring to something which happened in the past, is in fact
> >   referring to something which might happen in the future.
> > 
> > - fails to call trace_finish_task_reaping() in one case
> > 
> > - code duplication.
> > 
> > - Increases mmap_sem hold time a little by moving
> >   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> > 
> > - Sharing the finish: path means that the trace event won't
> >   distinguish between the two sources of finishing.
> > 
> > Please take a look?
> 
> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   /* failed to reap part of the address space. Try again later */
>   if (!__oom_reap_task_mm(mm)) {
>   up_read(&mm->mmap_sem);
> - return true;
> + return false;
>   }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",

OK, thanks, I added that.

> 
> On top of that the proposed cleanup looks as follows:
> 

Looks good to me.  Seems a bit strange that we omit the pr_info()
output if the mm was partially reaped - people would still want to know
this?   Not very important though.


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton

On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> ...
>
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - __oom_reap_task_mm(mm);
> + /* failed to reap part of the address space. Try again later */
> + if (!__oom_reap_task_mm(mm)) {
> + up_read(&mm->mmap_sem);
> + ret = false;
> + goto unlock_oom;
> + }

This function is starting to look a bit screwy.

: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(&mm->mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
:   up_read(&mm->mmap_sem);
:   trace_skip_task_reaping(tsk->pid);
:   return true;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   /* failed to reap part of the address space. Try again later */
:   if (!__oom_reap_task_mm(mm)) {
:   up_read(&mm->mmap_sem);
:   return true;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
:   up_read(&mm->mmap_sem);
: 
:   trace_finish_task_reaping(tsk->pid);
:   return true;
: }

- Undocumented return value.

- comment "failed to reap part..." is misleading - sounds like it's
  referring to something which happened in the past, is in fact
  referring to something which might happen in the future.

- fails to call trace_finish_task_reaping() in one case

- code duplication.


I'm thinking it wants to be something like this?

: /*
:  * Return true if we successfully acquired (then released) mmap_sem
:  */
: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(&mm->mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
:   trace_skip_task_reaping(tsk->pid);
:   goto out;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   if (!__oom_reap_task_mm(mm)) {
:   /* Failed to reap part of the address space. Try again later */
:   goto finish;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
: finish:
:   trace_finish_task_reaping(tsk->pid);
: out:
:   up_read(&mm->mmap_sem);
:   return true;
: }

- Increases mmap_sem hold time a little by moving
  trace_finish_task_reaping() inside the locked region.  So sue me ;)

- Sharing the finish: path means that the trace event won't
  distinguish between the two sources of finishing.

Please take a look?

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton

On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:

> > Any suggestions regarding how the driver developers can test this code
> > path?  I don't think we presently have a way to fake an oom-killing
> > event?  Perhaps we should add such a thing, given the problems we're
> > having with that feature.
> 
> The simplest way is to wrap an userspace code which uses these notifiers
> into a memcg and set the hard limit to hit the oom. This can be done
> e.g. after the test faults in all the mmu notifier managed memory and
> set the hard limit to something really small. Then we are looking for a
> proper process tear down.

Chances are, some of the intended audience don't know how to do this
and will either have to hunt down a lot of documentation or will just
not test it.  But we want them to test it, so a little worked step-by-step
example would help things along please.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Andrew Morton

On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.

I assume device driver developers are wondering "what does this mean
for me".  As I understand it, the only time they will see
blockable==false is when their driver is being called in response to an
out-of-memory condition, yes?  So it is a very rare thing.

Any suggestions regarding how the driver developers can test this code
path?  I don't think we presently have a way to fake an oom-killing
event?  Perhaps we should add such a thing, given the problems we're
having with that feature.

> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> ...
>
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> mm_struct *mm,
> +   unsigned long start, unsigned long end)
> +{
> + int ret = 0;
> + if (mm_has_notifiers(mm))
> + ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
> +
> + return ret;
>  }

nit,

{
if (mm_has_notifiers(mm))
return __mmu_notifier_invalidate_range_start(mm, start, end, 
false);
return 0;
}

would suffice.


> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>* reliably test it.
>*/
>   mutex_lock(&oom_lock);
> - __oom_reap_task_mm(mm);
> + (void)__oom_reap_task_mm(mm);
>   mutex_unlock(&oom_lock);

What does this do?

>   set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> ...
>


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm: don't defer struct page initialization for Xen pv guests

2018-02-20 Thread Andrew Morton

On Mon, 19 Feb 2018 02:45:27 +0800 kbuild test robot  wrote:

> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.16-rc1 next-20180216]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Juergen-Gross/mm-don-t-defer-struct-page-initialization-for-Xen-pv-guests/20180218-233657
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x010-201807 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>mm/page_alloc.c: In function 'update_defer_init':
> >> mm/page_alloc.c:352:6: error: implicit declaration of function 
> >> 'xen_pv_domain' [-Werror=implicit-function-declaration]
>  if (xen_pv_domain())
>  ^

I think I already fixed this.



From: Andrew Morton 
Subject: mm-dont-defer-struct-page-initialization-for-xen-pv-guests-fix

explicitly include xen.h

Cc: Juergen Gross 
Signed-off-by: Andrew Morton 
---

 mm/page_alloc.c |1 +
 1 file changed, 1 insertion(+)

diff -puN 
mm/page_alloc.c~mm-dont-defer-struct-page-initialization-for-xen-pv-guests-fix 
mm/page_alloc.c
--- 
a/mm/page_alloc.c~mm-dont-defer-struct-page-initialization-for-xen-pv-guests-fix
+++ a/mm/page_alloc.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
_



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RESEND v2] mm: don't defer struct page initialization for Xen pv guests

2018-02-16 Thread Andrew Morton

On Fri, 16 Feb 2018 16:41:01 +0100 Juergen Gross  wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -347,6 +347,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>   /* Always populate low zones for address-constrained allocations */
>   if (zone_end < pgdat_end_pfn(pgdat))
>   return true;
> + /* Xen PV domains need page structures early */
> + if (xen_pv_domain())
> + return true;

I'll do this:

--- 
a/mm/page_alloc.c~mm-dont-defer-struct-page-initialization-for-xen-pv-guests-fix
+++ a/mm/page_alloc.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 

So we're not relying on dumb luck ;)

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RESEND v2] mm: don't defer struct page initialization for Xen pv guests

2018-02-16 Thread Andrew Morton

On Fri, 16 Feb 2018 16:41:01 +0100 Juergen Gross  wrote:

> Commit f7f99100d8d95dbcf09e0216a143211e79418b9f ("mm: stop zeroing
> memory during allocation in vmemmap") broke Xen pv domains in some
> configurations, as the "Pinned" information in struct page of early
> page tables could get lost. This will lead to the kernel trying to
> write directly into the page tables instead of asking the hypervisor
> to do so. The result is a crash like the following:

Let's cc Pavel, who authored f7f99100d8d95d.

> [0.004000] BUG: unable to handle kernel paging request at 8801ead19008
> [0.004000] IP: xen_set_pud+0x4e/0xd0
> [0.004000] PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 
> 8011ead19065
> [0.004000] Oops: 0003 [#1] PREEMPT SMP
> [0.004000] Modules linked in:
> [0.004000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
> [0.004000] Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 
> 06/26/2014
> [0.004000] task: 81c10480 task.stack: 81c0
> [0.004000] RIP: e030:xen_set_pud+0x4e/0xd0
> [0.004000] RSP: e02b:81c03cd8 EFLAGS: 00010246
> [0.004000] RAX: 00280800 RBX: 88020fd31000 RCX: 
> 
> [0.004000] RDX: ea00 RSI: 0001b8308067 RDI: 
> 8801ead19008
> [0.004000] RBP: 8801ead19008 R08:  R09: 
> 063f4c80
> [0.004000] R10:  R11: 0720072007200720 R12: 
> 0001b8308067
> [0.004000] R13: 81c8a9cc R14: 88018fd31000 R15: 
> 77ff8000
> [0.004000] FS:  () GS:88020f60() 
> knlGS:
> [0.004000] CS:  e033 DS:  ES:  CR0: 80050033
> [0.004000] CR2: 8801ead19008 CR3: 01c09000 CR4: 
> 00042660
> [0.004000] Call Trace:
> [0.004000]  __pmd_alloc+0x128/0x140
> [0.004000]  ? acpi_os_map_iomem+0x175/0x1b0
> [0.004000]  ioremap_page_range+0x3f4/0x410
> [0.004000]  ? acpi_os_map_iomem+0x175/0x1b0
> [0.004000]  __ioremap_caller+0x1c3/0x2e0
> [0.004000]  acpi_os_map_iomem+0x175/0x1b0
> [0.004000]  acpi_tb_acquire_table+0x39/0x66
> [0.004000]  acpi_tb_validate_table+0x44/0x7c
> [0.004000]  acpi_tb_verify_temp_table+0x45/0x304
> [0.004000]  ? acpi_ut_acquire_mutex+0x12a/0x1c2
> [0.004000]  acpi_reallocate_root_table+0x12d/0x141
> [0.004000]  acpi_early_init+0x4d/0x10a
> [0.004000]  start_kernel+0x3eb/0x4a1
> [0.004000]  ? set_init_arg+0x55/0x55
> [0.004000]  xen_start_kernel+0x528/0x532
> [0.004000] Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 
> 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d <4c> 
> 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
> [0.004000] RIP: xen_set_pud+0x4e/0xd0 RSP: 81c03cd8
> [0.004000] CR2: 8801ead19008
> [0.004000] ---[ end trace 38eca2e56f1b642e ]---
> 
> Avoid this problem by not deferring struct page initialization when
> running as Xen pv guest.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -347,6 +347,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>   /* Always populate low zones for address-constrained allocations */
>   if (zone_end < pgdat_end_pfn(pgdat))
>   return true;
> + /* Xen PV domains need page structures early */
> + if (xen_pv_domain())
> + return true;
>   (*nr_initialised)++;
>   if ((*nr_initialised > pgdat->static_init_pgcnt) &&
>   (pfn & (PAGES_PER_SECTION - 1)) == 0) {

I'm OK with applying the patch as a short-term regression fix but I do
wonder whether it's the correct fix.  What is special about Xen (in
some configurations!) that causes it to find a hole in deferred
initialization?

I'd like us to delve further please.  Because if Xen found a hole in
the implementation, others might do so.  Or perhaps Xen is doing
something naughty.


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [PATCH v1 0/3] mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE

Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()

Re: [PATCH v2] mm, page_alloc: fix build_zonerefs_node()

Re: remove alloc_vm_area v2

Re: [PATCH v4 1/2] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

Re: [Xen-devel] [PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

Re: [Xen-devel] [PATCH] mm: don't defer struct page initialization for Xen pv guests

Re: [Xen-devel] [RESEND v2] mm: don't defer struct page initialization for Xen pv guests

Re: [Xen-devel] [RESEND v2] mm: don't defer struct page initialization for Xen pv guests

17 matches

Site Navigation

Mail list logo

Footer information