On 06.10.19 10:56, David Hildenbrand wrote:
> We currently try to shrink a single zone when removing memory. We use the
> zone of the first page of the memory we are removing. If that memmap was
> never initialized (e.g., memory was never onlined), we will read garbage
> and can trigger kernel BUGs (due to a stale pointer):
>
> :/# [ 23.912993] BUG: unable to handle page fault for address:
> 000000000000353d
> [ 23.914219] #PF: supervisor write access in kernel mode
> [ 23.915199] #PF: error_code(0x0002) - not-present page
> [ 23.916160] PGD 0 P4D 0
> [ 23.916627] Oops: 0002 [#1] SMP PTI
> [ 23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted
> 5.3.0-rc5-next-20190820+ #317
> [ 23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [ 23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [ 23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
> [ 23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3
> c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
> [ 23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
> [ 23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX:
> 0000000000000000
> [ 23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI:
> 0000000000002f40
> [ 23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09:
> 0000000000000001
> [ 23.932362] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000140000
> [ 23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15:
> ffff9e3e7aff3680
> [ 23.934913] FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000)
> knlGS:0000000000000000
> [ 23.936294] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4:
> 00000000000006e0
> [ 23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 23.941168] Call Trace:
> [ 23.941580] __remove_pages+0x4b/0x640
> [ 23.942303] ? mark_held_locks+0x49/0x70
> [ 23.943149] arch_remove_memory+0x63/0x8d
> [ 23.943921] try_remove_memory+0xdb/0x130
> [ 23.944766] ? walk_memory_blocks+0x7f/0x9e
> [ 23.945616] __remove_memory+0xa/0x11
> [ 23.946274] acpi_memory_device_remove+0x70/0x100
> [ 23.947308] acpi_bus_trim+0x55/0x90
> [ 23.947914] acpi_device_hotplug+0x227/0x3a0
> [ 23.948714] acpi_hotplug_work_fn+0x1a/0x30
> [ 23.949433] process_one_work+0x221/0x550
> [ 23.950190] worker_thread+0x50/0x3b0
> [ 23.950993] kthread+0x105/0x140
> [ 23.951644] ? process_one_work+0x550/0x550
> [ 23.952508] ? kthread_park+0x80/0x80
> [ 23.953367] ret_from_fork+0x3a/0x50
> [ 23.954025] Modules linked in:
> [ 23.954613] CR2: 000000000000353d
> [ 23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
>
> Instead, shrink the zones when offlining memory or when onlining failed.
> Introduce and use remove_pfn_range_from_zone(() for that. We now properly
> shrink the zones, even if we have DIMMs whereby
> - Some memory blocks fall into no zone (never onlined)
> - Some memory blocks fall into multiple zones (offlined+re-onlined)
> - Multiple memory blocks that fall into different zones
>
> Drop the zone parameter (with a potential dubious value) from
> __remove_pages() and __remove_section().
>
> Cc: Catalin Marinas <catalin.mari...@arm.com>
> Cc: Will Deacon <w...@kernel.org>
> Cc: Tony Luck <tony.l...@intel.com>
> Cc: Fenghua Yu <fenghua...@intel.com>
> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Cc: Paul Mackerras <pau...@samba.org>
> Cc: Michael Ellerman <m...@ellerman.id.au>
> Cc: Heiko Carstens <heiko.carst...@de.ibm.com>
> Cc: Vasily Gorbik <g...@linux.ibm.com>
> Cc: Christian Borntraeger <borntrae...@de.ibm.com>
> Cc: Yoshinori Sato <ys...@users.sourceforge.jp>
> Cc: Rich Felker <dal...@libc.org>
> Cc: Dave Hansen <dave.han...@linux.intel.com>
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Borislav Petkov <b...@alien8.de>
> Cc: "H. Peter Anvin" <h...@zytor.com>
> Cc: x...@kernel.org
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Mark Rutland <mark.rutl...@arm.com>
> Cc: Steve Capper <steve.cap...@arm.com>
> Cc: Mike Rapoport <r...@linux.ibm.com>
> Cc: Anshuman Khandual <anshuman.khand...@arm.com>
> Cc: Yu Zhao <yuz...@google.com>
> Cc: Jun Yao <yaojun8558...@gmail.com>
> Cc: Robin Murphy <robin.mur...@arm.com>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Oscar Salvador <osalva...@suse.de>
> Cc: "Matthew Wilcox (Oracle)" <wi...@infradead.org>
> Cc: Christophe Leroy <christophe.le...@c-s.fr>
> Cc: "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com>
> Cc: Pavel Tatashin <pasha.tatas...@soleen.com>
> Cc: Gerald Schaefer <gerald.schae...@de.ibm.com>
> Cc: Halil Pasic <pa...@linux.ibm.com>
> Cc: Tom Lendacky <thomas.lenda...@amd.com>
> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> Cc: Masahiro Yamada <yamada.masah...@socionext.com>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: Wei Yang <richard.weiy...@gmail.com>
> Cc: Qian Cai <c...@lca.pw>
> Cc: Jason Gunthorpe <j...@ziepe.ca>
> Cc: Logan Gunthorpe <log...@deltatee.com>
> Cc: Ira Weiny <ira.we...@intel.com>
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
@Andrew, can you convert that to
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to
zones until online") # visible after d0dc12e86b319
While adding cc'ing sta...@vger.kernel.org # v4.13+ would be nice,
I doubt it will be easily possible to backport, as we are missing
some prereq patches (e.g., from Oscar like 2c2a5af6fed2 ("mm,
memory_hotplug: add nid parameter to arch_remove_memory")). But, it could
be done with some work.
I think "Cc: sta...@vger.kernel.org # v5.0+" could be done more
easily. Maybe it's okay to not cc:stable this one. We usually
online all memory (except s390x), however, s390x does not remove that
memory ever. Devmem with driver reserved memory would be, however,
worth backporting this.
Thoughts?
--
Thanks,
David / dhildenb