Re: [PATCH] crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled

2023-11-27 Thread Baoquan He
On 11/28/23 at 11:31am, Shijie Huang wrote:
> 
> 在 2023/11/28 11:25, Baoquan He 写道:
> > On 11/27/23 at 11:18am, Shijie Huang wrote:
> > > 在 2023/11/27 10:51, Baoquan He 写道:
> > > > Hi,
> > > > 
> > > > On 11/27/23 at 10:07am, Huang Shijie wrote:
> > > > > In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed,
> > > > > kernel will use vmemmap to do the __pfn_to_page/page_to_pfn,
> > > > > and kernel will not use the "classic sparse" to do the
> > > > > __pfn_to_page/page_to_pfn.
> > > > > 
> > > > > So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed.
> > > > > This makes the user applications (crash, etc) get faster
> > > > > pfn_to_page/page_to_pfn operations too.
> > > > Are there Crash or makedupfile patches posted yet to make use of this?
> > > I have patches for Crash to use the 'vmemmap', but after this patch is
> > > merged, I will send it out.
> > > 
> > > (I think Kazu will not merge a crash patch which depends on a kernel patch
> > > which is not merged.)
> > Maybe post these userspace patches too so that Kazu can evaluat if those
> > improvement is necessary?
> 
> No problem.  I will send out them later.

Thank, Shijie. Let's wait and see if Kazu has any comment about these.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] LoongArch: Load vmlinux.efi to the link address

2023-11-27 Thread hev
Hello Simon,

Could you apply this patch (v2) instead of patch (v1) [1] ?
Thanks!

[1] https://lore.kernel.org/kexec/20231124154658.114579-1-wang...@loongson.cn/

On Sat, Nov 25, 2023 at 2:52 PM WANG Rui  wrote:
>
> Currently, kexec loads vmlinux.efi to address 0 instead of the link
> address. This causes kexec to fail to boot the new vmlinux.efi on qemu.
>
>   pei_loongarch_load: kernel_segment: 
>   pei_loongarch_load: kernel_entry:   013f1000
>   pei_loongarch_load: image_size: 01ca
>   pei_loongarch_load: text_offset:0020
>   pei_loongarch_load: phys_offset:
>   pei_loongarch_load: PE format:  yes
>   loongarch_load_other_segments:333: command_line: kexec console=ttyS0,115200
>   kexec_load: entry = 0x13f1000 flags = 0x102
>   nr_segments = 2
>   segment[0].buf   = 0x7fffeea38010
>   segment[0].bufsz = 0x1b55200
>   segment[0].mem   = (nil)
>   segment[0].memsz = 0x1ca
>   segment[1].buf   = 0x570940b0
>   segment[1].bufsz = 0x200
>   segment[1].mem   = 0x1ca
>   segment[1].memsz = 0x4000
>
> This patch constrains the range of the kernel segment by `hole_min`
> and `hole_max` to place vmlinux.efi exactly at the link address.
>
>   pei_loongarch_load: kernel_segment: 0020
>   pei_loongarch_load: kernel_entry:   013f1000
>   pei_loongarch_load: image_size: 01ca
>   pei_loongarch_load: text_offset:0020
>   pei_loongarch_load: phys_offset:
>   pei_loongarch_load: PE format:  yes
>   loongarch_load_other_segments:339: command_line: kexec console=ttyS0,115200
>   kexec_load: entry = 0x13f1000 flags = 0x102
>   nr_segments = 2
>   segment[0].buf   = 0x72028010
>   segment[0].bufsz = 0x1b55200
>   segment[0].mem   = 0x20
>   segment[0].memsz = 0x1ca
>   segment[1].buf   = 0x57498098
>   segment[1].bufsz = 0x200
>   segment[1].mem   = 0x1ea
>   segment[1].memsz = 0x4000
>
> Signed-off-by: WANG Rui 
> ---
>
> v1->v2:
>  * Fix the issue preventing it from working on the physical machine.
>
>  kexec/arch/loongarch/kexec-loongarch.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/kexec/arch/loongarch/kexec-loongarch.c 
> b/kexec/arch/loongarch/kexec-loongarch.c
> index 62ff8fd..32a42d2 100644
> --- a/kexec/arch/loongarch/kexec-loongarch.c
> +++ b/kexec/arch/loongarch/kexec-loongarch.c
> @@ -265,9 +265,13 @@ unsigned long loongarch_locate_kernel_segment(struct 
> kexec_info *info)
> hole = ULONG_MAX;
> }
> } else {
> -   hole = locate_hole(info,
> -   loongarch_mem.text_offset + loongarch_mem.image_size,
> -   MiB(1), 0, ULONG_MAX, 1);
> +   unsigned long hole_min;
> +   unsigned long hole_max;
> +
> +   hole_min = loongarch_mem.text_offset;
> +   hole_max = hole_min + loongarch_mem.image_size;
> +   hole = locate_hole(info, loongarch_mem.image_size,
> +   MiB(1), hole_min, hole_max, 1);
>
> if (hole == ULONG_MAX)
> dbgprintf("%s: locate_hole failed\n", __func__);
> --
> 2.42.0
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH v2] LoongArch: Load vmlinux.efi to the link address

2023-11-27 Thread WANG Rui
Hi,

On Mon, Nov 27, 2023 at 10:36 AM RuiRui Yang  wrote:
>
> On Mon, 27 Nov 2023 at 09:53, RuiRui Yang  wrote:
> >
> > On Sat, 25 Nov 2023 at 14:54, WANG Rui  wrote:
> > >
> > > Currently, kexec loads vmlinux.efi to address 0 instead of the link
> > > address. This causes kexec to fail to boot the new vmlinux.efi on qemu.
> > >
> > >   pei_loongarch_load: kernel_segment: 
> > >   pei_loongarch_load: kernel_entry:   013f1000
> > >   pei_loongarch_load: image_size: 01ca
> > >   pei_loongarch_load: text_offset:0020
> > >   pei_loongarch_load: phys_offset:
> > >   pei_loongarch_load: PE format:  yes
> > >   loongarch_load_other_segments:333: command_line: kexec 
> > > console=ttyS0,115200
> > >   kexec_load: entry = 0x13f1000 flags = 0x102
> > >   nr_segments = 2
> > >   segment[0].buf   = 0x7fffeea38010
> > >   segment[0].bufsz = 0x1b55200
> > >   segment[0].mem   = (nil)
> > >   segment[0].memsz = 0x1ca
> > >   segment[1].buf   = 0x570940b0
> > >   segment[1].bufsz = 0x200
> > >   segment[1].mem   = 0x1ca
> > >   segment[1].memsz = 0x4000
> > >
> > > This patch constrains the range of the kernel segment by `hole_min`
> > > and `hole_max` to place vmlinux.efi exactly at the link address.
> > >
> > >   pei_loongarch_load: kernel_segment: 0020
> > >   pei_loongarch_load: kernel_entry:   013f1000
> > >   pei_loongarch_load: image_size: 01ca
> > >   pei_loongarch_load: text_offset:0020
> > >   pei_loongarch_load: phys_offset:
> > >   pei_loongarch_load: PE format:  yes
> > >   loongarch_load_other_segments:339: command_line: kexec 
> > > console=ttyS0,115200
> > >   kexec_load: entry = 0x13f1000 flags = 0x102
> > >   nr_segments = 2
> > >   segment[0].buf   = 0x72028010
> > >   segment[0].bufsz = 0x1b55200
> > >   segment[0].mem   = 0x20
> > >   segment[0].memsz = 0x1ca
> > >   segment[1].buf   = 0x57498098
> > >   segment[1].bufsz = 0x200
> > >   segment[1].mem   = 0x1ea
> > >   segment[1].memsz = 0x4000
> > >
> > > Signed-off-by: WANG Rui 
> > > ---
> > >
> > > v1->v2:
> > >  * Fix the issue preventing it from working on the physical machine.
> > >
> > >  kexec/arch/loongarch/kexec-loongarch.c | 10 +++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kexec/arch/loongarch/kexec-loongarch.c 
> > > b/kexec/arch/loongarch/kexec-loongarch.c
> > > index 62ff8fd..32a42d2 100644
> > > --- a/kexec/arch/loongarch/kexec-loongarch.c
> > > +++ b/kexec/arch/loongarch/kexec-loongarch.c
> > > @@ -265,9 +265,13 @@ unsigned long loongarch_locate_kernel_segment(struct 
> > > kexec_info *info)
> > > hole = ULONG_MAX;
> > > }
> > > } else {
> > > -   hole = locate_hole(info,
> > > -   loongarch_mem.text_offset + 
> > > loongarch_mem.image_size,
> > > -   MiB(1), 0, ULONG_MAX, 1);
> > > +   unsigned long hole_min;
> > > +   unsigned long hole_max;
> > > +
> > > +   hole_min = loongarch_mem.text_offset;
> > > +   hole_max = hole_min + loongarch_mem.image_size;
> > > +   hole = locate_hole(info, loongarch_mem.image_size,
> > > +   MiB(1), hole_min, hole_max, 1);
> > >
> > > if (hole == ULONG_MAX)
> > > dbgprintf("%s: locate_hole failed\n", __func__);
> >
> > Hi,
> >
> > Previously when I played with the zboot kernel on a kvm guest I
> > noticed this issue, but I found that the 1st 2M memory is memblock
> > reserved but it is not shown in /proc/iomem as reserved, I suspect the
> > 1st 2M is not usable for some arch specific reason but I was not sure.
> >   Below patch can fix it but due to my rusty knowledge of loongarch I
>
> Correct about my English wording a bit, I meant about rusty knowledge
> of kexec details and newbish loongarch knowledge,
> BTW, the webmail often randomly choose the sender email, I usually use
> another email for community, that is Dave Young ,
> same person ;)
>
> Anyway, since this is loongarch specific, it would be better to leave
> you guys the arch people to see how to fix it better.
>
> > did not send it out. I suspect even if the locate_hole avoids the
> > wrong memory, in the 2nd kernel it could still access it.  Correct?

I can confirm that the mapping of the 1st 2M in iomem on qemu causes
kexec to not work. The root cause is that LoongArch's vmlinux.efi can
only run on the link address, which is why I limit the allocation
location of the kernel segment through hole min/max, not the kernel's
iomem.

Hucai, what do you think about the 1st 2M mapping type in the kernel?

> >
> > Index: linux/arch/loongarch/kernel/mem.c
> > ===
> > --- linux.orig/arch/loongarch/kernel/mem.c  2023-06-02
> > 

[PATCH v2] drivers/base/cpu: crash data showing should depends on KEXEC_CORE

2023-11-27 Thread Baoquan He
After commit 88a6f8994421 ("crash: memory and CPU hotplug sysfs
attributes"), on x86_64, if only below kernel configs related to kdump
are set, compiling error are triggered.


CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
--

--
drivers/base/cpu.c: In function ‘crash_hotplug_show’:
drivers/base/cpu.c:309:40: error: implicit declaration of function 
‘crash_hotplug_cpu_support’; did you mean ‘crash_hotplug_show’? 
[-Werror=implicit-function-declaration]
  309 | return sysfs_emit(buf, "%d\n", crash_hotplug_cpu_support());
  |^
  |crash_hotplug_show
cc1: some warnings being treated as errors
--

CONFIG_KEXEC is used to enable kexec_load interface, the
crash_notes/crash_notes_size/crash_hotplug showing depends on
CONFIG_KEXEC is incorrect. It should depend on KEXEC_CORE instead.

Fix it now.

Fixes: commit 88a6f8994421 ("crash: memory and CPU hotplug sysfs attributes")
Signed-off-by: Baoquan He 
---
 drivers/base/cpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 9ea22e165acd..548491de818e 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -144,7 +144,7 @@ static DEVICE_ATTR(release, S_IWUSR, NULL, 
cpu_release_store);
 #endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
 #endif /* CONFIG_HOTPLUG_CPU */
 
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
 #include 
 
 static ssize_t crash_notes_show(struct device *dev,
@@ -189,14 +189,14 @@ static const struct attribute_group 
crash_note_cpu_attr_group = {
 #endif
 
 static const struct attribute_group *common_cpu_attr_groups[] = {
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
_note_cpu_attr_group,
 #endif
NULL
 };
 
 static const struct attribute_group *hotplugable_cpu_attr_groups[] = {
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
_note_cpu_attr_group,
 #endif
NULL
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH v2] kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP

2023-11-27 Thread Baoquan He
Ignat Korchagin complained that a potential config regression was
introduced by commit 89cde455915f ("kexec: consolidate kexec and
crash options into kernel/Kconfig.kexec"). Before the commit,
CONFIG_CRASH_DUMP has no dependency on CONFIG_KEXEC. After the commit,
CRASH_DUMP selects KEXEC. That enforces system to have CONFIG_KEXEC=y
as long as CONFIG_CRASH_DUMP=Y which people may not want.

In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and
CONFIG_KEXEC=n because kexec_load interface could have security issue if
kernel/initrd has no chance to be signed and verified.

CRASH_DUMP has select of KEXEC because Eric, author of above commit,
met a LKP report of build failure when posting patch of earlier version.
Please see below link to get detail of the LKP report:


https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b9...@oracle.com/T/#u

In fact, that LKP report is triggered because arm's  is
wrapped in CONFIG_KEXEC ifdeffery scope. That is wrong. CONFIG_KEXEC
controls the enabling/disabling of kexec_load interface, but not kexec
feature. Removing the wrongly added CONFIG_KEXEC ifdeffery scope in
 of arm allows us to drop the select KEXEC for CRASH_DUMP.
Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o
relocate_kernel.o depend on KEXEC_CORE.

Fixes: commit 89cde455915f ("kexec: consolidate kexec and crash options into 
kernel/Kconfig.kexec")
Reported-by: Ignat Korchagin 
Signed-off-by: Baoquan He 
---
 arch/arm/include/asm/kexec.h | 4 
 arch/arm/kernel/Makefile | 2 +-
 kernel/Kconfig.kexec | 1 -
 3 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/kexec.h b/arch/arm/include/asm/kexec.h
index e62832dcba76..a8287e7ab9d4 100644
--- a/arch/arm/include/asm/kexec.h
+++ b/arch/arm/include/asm/kexec.h
@@ -2,8 +2,6 @@
 #ifndef _ARM_KEXEC_H
 #define _ARM_KEXEC_H
 
-#ifdef CONFIG_KEXEC
-
 /* Maximum physical address we can use pages from */
 #define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
 /* Maximum address we can reach in physical address mode */
@@ -82,6 +80,4 @@ static inline struct page *boot_pfn_to_page(unsigned long 
boot_pfn)
 
 #endif /* __ASSEMBLY__ */
 
-#endif /* CONFIG_KEXEC */
-
 #endif /* _ARM_KEXEC_H */
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index d53f56d6f840..771264d4726a 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o insn.o patch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o
 obj-$(CONFIG_JUMP_LABEL)   += jump_label.o insn.o patch.o
-obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o
+obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
 # Main staffs in KPROBES are in arch/arm/probes/ .
 obj-$(CONFIG_KPROBES)  += patch.o insn.o
 obj-$(CONFIG_OABI_COMPAT)  += sys_oabi-compat.o
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 7aff28ded2f4..1cc3b1c595d7 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -97,7 +97,6 @@ config CRASH_DUMP
depends on ARCH_SUPPORTS_KEXEC
select CRASH_CORE
select KEXEC_CORE
-   select KEXEC
help
  Generate crash dump after being started by kexec.
  This should be normally only set in special crash dump kernels
-- 
2.41.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled

2023-11-27 Thread Shijie Huang


在 2023/11/28 11:25, Baoquan He 写道:

On 11/27/23 at 11:18am, Shijie Huang wrote:

在 2023/11/27 10:51, Baoquan He 写道:

Hi,

On 11/27/23 at 10:07am, Huang Shijie wrote:

In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed,
kernel will use vmemmap to do the __pfn_to_page/page_to_pfn,
and kernel will not use the "classic sparse" to do the
__pfn_to_page/page_to_pfn.

So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed.
This makes the user applications (crash, etc) get faster
pfn_to_page/page_to_pfn operations too.

Are there Crash or makedupfile patches posted yet to make use of this?

I have patches for Crash to use the 'vmemmap', but after this patch is
merged, I will send it out.

(I think Kazu will not merge a crash patch which depends on a kernel patch
which is not merged.)

Maybe post these userspace patches too so that Kazu can evaluat if those
improvement is necessary?


No problem.  I will send out them later.


Thanks

Huang Shijie


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled

2023-11-27 Thread Baoquan He
On 11/27/23 at 11:18am, Shijie Huang wrote:
> 
> 在 2023/11/27 10:51, Baoquan He 写道:
> > Hi,
> > 
> > On 11/27/23 at 10:07am, Huang Shijie wrote:
> > > In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed,
> > > kernel will use vmemmap to do the __pfn_to_page/page_to_pfn,
> > > and kernel will not use the "classic sparse" to do the
> > > __pfn_to_page/page_to_pfn.
> > > 
> > > So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed.
> > > This makes the user applications (crash, etc) get faster
> > > pfn_to_page/page_to_pfn operations too.
> > Are there Crash or makedupfile patches posted yet to make use of this?
> 
> I have patches for Crash to use the 'vmemmap', but after this patch is
> merged, I will send it out.
> 
> (I think Kazu will not merge a crash patch which depends on a kernel patch
> which is not merged.)

Maybe post these userspace patches too so that Kazu can evaluat if those
improvement is necessary?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-11-27 Thread Baoquan He
On 11/28/23 at 09:12am, Tao Liu wrote:
> Hi Jiri,
> 
> On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac  wrote:
> >
> > Hi Tao,
> >
> > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > > However I have a question:
> > >
> > > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > > kernel crash? There might be data corruption when 2nd kernel and
> > > DMA/RDMA write to the same place, how to address such an issue?
> >
> > The crash kernel CMA area(s) registered via
> > cma_declare_contiguous() are distinct from the
> > dma_contiguous_default_area or device-specific CMA areas that
> > dma_alloc_contiguous() would use to reserve memory for DMA.
> >
> > Kernel pages will not be allocated from the crash kernel CMA
> > area(s), because they are not GFP_MOVABLE. The CMA area will only
> > be used for user pages.
> >
> > User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> > would migrate them away from the CMA area.
> >
> > But you're right that DMA to user pages pinned without
> > FOLL_LONGTERM would still be possible. Would this be a problem in
> > practice? Do you see any way around it?

Thanks for the effort to bring this up, Jiri.

I am wondering how you will use this crashkernel=,cma parameter. I mean
the scenario of crashkernel=,cma. Asking this because I don't know how
SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
initramfs is the same as the 1st kernel, or only contain those needed
kernel modules for needed devices. E.g if we dump to local disk, NIC
driver will be filter out? If latter case, It's possibly having the
on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
reset during kdump bootup because the NIC driver is not loaded in to
initialize. Not sure if this is 100%, possible in theory?

Recently we are seeing an issue that on a HPE system, PCI error messages
are always seen in kdump kernel, while it's a local dump, NIC device is
not needed and the igb driver is not loaded in. Then adding igb driver
into kdump initramfs can work around it. It's similar with above
on-flight DMA.

The crashkernel=,cma requires no userspace data dumping, from our
support engineers' feedback, customer never express they don't need to
dump user space data. Assume a server with huge databse deployed, and
the database often collapsed recently and database provider claimed that
it's not database's fault, OS need prove their innocence. What will you
do?

So this looks like a nice to have to me. At least in fedora/rhel's
usage, we may only back port this patch, and add one sentence in our
user guide saying "there's a crashkernel=,cma added, can be used with
crashkernel= to save memory. Please feel free to try if you like".
Unless SUSE or other distros decides to use it as default config or
something like that. Please correct me if I missed anything or took
anything wrong.

Thanks
Baoquan


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-11-27 Thread Pingfan Liu
On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac  wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

I have not a real case in mind. But this problem has kept us from
using the CMA area in kdump for years.  Most importantly, this method
will introduce an uneasy tracking bug.

For a way around, maybe you can introduce a specific zone, and for any
GUP, migrate the pages away. I have doubts about whether this approach
is worthwhile, considering the trade-off between benefits and
complexity.

Thanks,

Pingfan


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-11-27 Thread Tao Liu
Hi Jiri,

On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac  wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

Thanks for the explanation! Sorry I don't have any ideas so far...

@Pingfan Liu @Baoquan He Hi, do you have any suggestions for it?

Thanks,
Tao Liu

> Thanks,
>
> --
> Jiri Bohac 
> SUSE Labs, Prague, Czechia
>


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec_file: add kexec_file flag to support debug printing

2023-11-27 Thread Baoquan He
On 11/27/23 at 01:32pm, Simon Horman wrote:
> On Tue, Nov 14, 2023 at 11:20:30PM +0800, Baoquan He wrote:
> > This add KEXEC_FILE_DEBUG to kexec_file_flags so that it can be passed
> > to kernel when '-d' is added with kexec_file_load interface. With that
> > flag enabled, kernel can enable the debugging message printing.
> > 
> > Signed-off-by: Baoquan He 
> 
> Thanks,
> 
> this looks fine to me.
> But perhaps the corresponding Kernel patch should land first?
> Let me know what you think.

Yes, agree. I will ping you once kernel patches are taken. Thanks.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [RFC V2] IMA Log Snapshotting Design Proposal

2023-11-27 Thread Paul Moore
On Mon, Nov 27, 2023 at 12:08 PM Mimi Zohar  wrote:
> On Wed, 2023-11-22 at 09:22 -0500, Paul Moore wrote:

...

> > Okay, we are starting to get closer, but I'm still missing the part
> > where you say "if you do X, Y, and Z, I'll accept and merge the
> > solution."  Can you be more explicit about what approach(es) you would
> > be willing to accept upstream?
>
> Included with what is wanted/needed is an explanation as to my concerns
> with the existing proposal.
>
> First we need to differentiate between kernel and uhserspace
> requirements.  (The "snapshotting" design proposal intermixes them.)
>
> From the kernel persective, the Log Snapshotting Design proposal "B.1
> Goals" is very nice, but once the measurement list can be trimmed it is
> really irrelevant.  Userspace can do whatever it wants with the
> measurement list records.  So instead of paying lip service to what
> should be done, just call it as it is - trimming the measurement list.

Fair enough.  I personally think it is nice to have a brief discussion
of how userspace might use a kernel feature, but if you prefer to drop
that part of the design doc I doubt anyone will object very strongly.

> ---
> | B.1 Goals   |
> ---
> To address the issues described in the section above, we propose
> enhancements to the IMA subsystem to achieve the following goals:
>
>   a. Reduce memory pressure on the Kernel caused by larger in-memory
>  IMA logs.
>
>   b. Preserve the system's ability to get remotely attested using the
>  IMA log, even after implementing the enhancements to reduce memory
>  pressure caused by the IMA log. IMA's Integrity guarantees should
>  be maintained.
>
>   c. Provide mechanisms from Kernel side to the remote attestation
>  service to make service-side processing more efficient.

That looks fine to me.

> From the kernel perspective there needs to be a method of trimming N
> number of records from the head of the measurement list.  In addition
> to the existing securityfs "runtime measurement list",  defining a new
> securityfs file containing the current count of in memory measurement
> records would be beneficial.

I imagine that should be trivial to implement and I can't imagine
there being any objection to that.

If we are going to have a record count, I imagine it would also be
helpful to maintain a securityfs file with the total size (in bytes)
of the in-memory measurement log.  In fact, I suspect this will
probably be more useful for those who wish to manage the size of the
measurement log.

> Defining other IMA securityfs files like
> how many times the measurement list has been trimmed might be
> beneficial as well.

I have no objection to that.  Would a total record count, i.e. a value
that doesn't reset on a snapshot event, be more useful here?

> Of course properly document the integrity
> implications and repercussions of the new Kconfig that allows trimming
> the measurement list.

Of course.

> Defining a simple "trim" marker measurement record would be a visual
> indication that the measurement list has been trimmed.  I might even
> have compared it to the "boot_aggregate".  However, the proposed marker
> based on TPM PCRs requires pausing extending the measurement list.

...

> Before defining a new critical-data record, we need to decide whether
> it is really necessary or if it is redundant.  If we define a new
> "critical-data" record, can it be defined such that it doesn't require
> pausing extending the measurement list?  For example, a new simple
> visual critical-data record could contain the number of records (e.g.
> /ima/runtime_measurements_count) up to that point.

What if the snapshot_aggregate was a hash of the measurement log
starting with either the boot_aggregate or the latest
snapshot_aggregate and ending on the record before the new
snapshot_aggregate?  The performance impact at snapshot time should be
minimal as the hash can be incrementally updated as new records are
added to the measurement list.  While the hash wouldn't capture the
TPM state, it would allow some crude verification when reassembling
the log.  If one could bear the cost of a TPM signing operation, the
log digest could be signed by the TPM.

> The new critical-data record and trimming the measurement list should
> be disjoint features.  If the first record after trimming the
> measurement list should be the critical-data record, then trim the
> measurement list up to that point.

I disagree about the snapshot_aggregate record being disjoint from the
measurement log, but I suspect Tushar and Sush are willing to forgo
the snapshot_aggregate if that is a blocker from your perspective.
Once again, the main goal is the ability to manage the size of the
measurement log; while having a snapshot_aggregate that can be used to

Re: [RFC V2] IMA Log Snapshotting Design Proposal

2023-11-27 Thread Mimi Zohar
On Wed, 2023-11-22 at 09:22 -0500, Paul Moore wrote:
> On Wed, Nov 22, 2023 at 8:18 AM Mimi Zohar  wrote:
> > On Tue, 2023-11-21 at 23:27 -0500, Paul Moore wrote:
> > > On Thu, Nov 16, 2023 at 5:28 PM Paul Moore  wrote:
> > > > On Tue, Oct 31, 2023 at 3:15 PM Mimi Zohar  wrote:
> > >
> > > ...
> > >
> > > > > Userspace can already export the IMA measurement list(s) via the
> > > > > securityfs {ascii,binary}_runtime_measurements file(s) and do whatever
> > > > > it wants with it.  All that is missing in the kernel is the ability to
> > > > > trim the measurement list, which doesn't seem all that complicated.
> > > >
> > > > From my perspective what has been presented is basically just trimming
> > > > the in-memory measurement log, the additional complexity (which really
> > > > doesn't look that bad IMO) is there to ensure robustness in the face
> > > > of an unreliable userspace (processes die, get killed, etc.) and to
> > > > establish a new, transitive root of trust in the newly trimmed
> > > > in-memory log.
> > > >
> > > > I suppose one could simplify things greatly by having a design where
> > > > userspace  captures the measurement log and then writes the number of
> > > > measurement records to trim from the start of the measurement log to a
> > > > sysfs file and the kernel acts on that.  You could do this with, or
> > > > without, the snapshot_aggregate entry concept; in fact that could be
> > > > something that was controlled by userspace, e.g. write the number of
> > > > lines and a flag to indicate if a snapshot_aggregate was desired to
> > > > the sysfs file.  I can't say I've thought it all the way through to
> > > > make sure there are no gotchas, but I'm guessing that is about as
> > > > simple as one can get.
> >
> > > > If there is something else you had in mind, Mimi, please share the
> > > > details.  This is a very real problem we are facing and we want to
> > > > work to get a solution upstream.
> > >
> > > Any thoughts on this Mimi?  We have a real interest in working with
> > > you to solve this problem upstream, but we need more detailed feedback
> > > than "too complicated".  If you don't like the solutions presented
> > > thus far, what type of solution would you like to see?
> >
> > Paul, the design copies the measurement list to a temporary "snapshot"
> > file, before trimming the measurement list, which according to the
> > design document locks the existing measurement list.  And further
> > pauses extending the measurement list to calculate the
> > "snapshot_aggregate".
> 
> I believe the intent is to only pause the measurements while the
> snapshot_aggregate is generated, not for the duration of the entire
> snapshot process.  The purpose of the snapshot_aggregate is to
> establish a new root of trust, similar to the boot_aggregate, to help
> improve attestation performance.
> 
> > Userspace can export the measurement list already, so why this
> > complicated design?
> 
> The current code has no provision for trimming the measurement log,
> that's the primary reason.
> 
> > As I mentioned previously and repeated yesterday, the
> > "snapshot_aggregate" is a new type of critical data and should be
> > upstreamed independently of this patch set that trims the measurement
> > list.  Trimming the measurement list could be based, as you suggested
> > on the number of records to remove, or it could be up to the next/last
> > "snapshot_aggregate" record.
> 
> Okay, we are starting to get closer, but I'm still missing the part
> where you say "if you do X, Y, and Z, I'll accept and merge the
> solution."  Can you be more explicit about what approach(es) you would
> be willing to accept upstream?

Included with what is wanted/needed is an explanation as to my concerns
with the existing proposal.

First we need to differentiate between kernel and uhserspace
requirements.  (The "snapshotting" design proposal intermixes them.)

>From the kernel persective, the Log Snapshotting Design proposal "B.1
Goals" is very nice, but once the measurement list can be trimmed it is
really irrelevant.  Userspace can do whatever it wants with the
measurement list records.  So instead of paying lip service to what
should be done, just call it as it is - trimming the measurement list.

---
| B.1 Goals   |
---
To address the issues described in the section above, we propose
enhancements to the IMA subsystem to achieve the following goals:

  a. Reduce memory pressure on the Kernel caused by larger in-memory
 IMA logs.

  b. Preserve the system's ability to get remotely attested using the
 IMA log, even after implementing the enhancements to reduce memory
 pressure caused by the IMA log. IMA's Integrity guarantees should
 be maintained.

  c. Provide mechanisms from Kernel side to the remote 

Re: [PATCH] LoongArch: Fix an issue with relocatable vmlinux

2023-11-27 Thread Simon Horman
On Fri, Nov 24, 2023 at 04:54:10PM +0800, WANG Rui wrote:
> Normally vmlinux for LoongArch is of ET_EXEC type, while if built with
> CONFIG_RELOCATABLE (this is PIE) and Clang, it will be of ET_DYN type.
> Meanwhile, physical address field of segments in vmlinux has actually
> the same value as virtual address field.
> 
> Similar to arm64, this patch allows to unconditionally skip the check
> on LoongArch.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/1963
> Signed-off-by: WANG Rui 

Thanks, applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] m68k: fix getrandom() use with uclibc

2023-11-27 Thread Simon Horman
On Sat, Apr 22, 2023 at 11:59:04AM +0200, Laurent Vivier wrote:
> With uclibc, getrandom() is only defined with _GNU_SOURCE, fix that:
> 
> kexec/arch/m68k/bootinfo.c: In function 'bootinfo_add_rng_seed':
> kexec/arch/m68k/bootinfo.c:231:13: warning: implicit declaration of function 
> 'getrandom'; did you mean 'srandom'? [-Wimplicit-function-declaration]
>   231 | if (getrandom(bi->rng_seed.data, RNG_SEED_LEN, GRND_NONBLOCK) 
> != RNG_SEED_LEN) {
>   | ^
>   | srandom
> kexec/arch/m68k/bootinfo.c:231:56: error: 'GRND_NONBLOCK' undeclared (first 
> use in this function)
>   231 | if (getrandom(bi->rng_seed.data, RNG_SEED_LEN, GRND_NONBLOCK) 
> != RNG_SEED_LEN) {
>   |^
> 
> Fixes:  b9de05184816 ("m68k: pass rng seed via BI_RNG_SEED")
> Cc: ja...@zx2c4.com
> Signed-off-by: Laurent Vivier 

Thanks, applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] lzma: Relax memory limit for lzma decompressor

2023-11-27 Thread Simon Horman
On Sat, Nov 25, 2023 at 03:26:43PM +0800, WANG Rui wrote:
> The kexec cannot load LZMA compressed vmlinuz.efi on LoongArch.
> 
>   Try LZMA decompression.
>   lzma_decompress_file: read on /tmp/Image4yyfhM of 65536 bytes failed
>   pez_prepare: decompressed size 8563960
>   pez_prepare: done
>   Cannot load vmlinuz.efi
> 
> The root cause is that lzma decompressor requires more memory usage,
> which exceeds the current 64M limit.
> 
> Reported-by: Huacai Chen 
> Signed-off-by: WANG Rui 

Thanks, applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] LoongArch: Load vmlinux.efi to the link address

2023-11-27 Thread Simon Horman
On Fri, Nov 24, 2023 at 11:46:58PM +0800, WANG Rui wrote:
> Currently, kexec loads vmlinux.efi to address 0 instead of the link
> address. This causes kexec to fail to boot the new vmlinux.efi on qemu.
> 
>   pei_loongarch_load: kernel_segment: 
>   pei_loongarch_load: kernel_entry:   015dc000
>   pei_loongarch_load: image_size: 01f3
>   pei_loongarch_load: text_offset:0020
>   pei_loongarch_load: phys_offset:
>   pei_loongarch_load: PE format:  yes
>   loongarch_load_other_segments:333: command_line: kexec console=ttyS0,115200
>   kexec_load: entry = 0x15dc000 flags = 0x102
>   nr_segments = 2
>   segment[0].buf   = 0x7fffef664010
>   segment[0].bufsz = 0x1de9a00
>   segment[0].mem   = (nil)
>   segment[0].memsz = 0x1f3
>   segment[1].buf   = 0x55e480b0
>   segment[1].bufsz = 0x200
>   segment[1].mem   = 0x1f3
>   segment[1].memsz = 0x4000
> 
> This patch adds `text_offset` when adding kernel segment to load
> vmlinux.efi to the link address.
> 
>   pei_loongarch_load: kernel_segment: 
>   pei_loongarch_load: kernel_entry:   015dc000
>   pei_loongarch_load: image_size: 01f3
>   pei_loongarch_load: text_offset:0020
>   pei_loongarch_load: phys_offset:
>   pei_loongarch_load: PE format:  yes
>   loongarch_load_other_segments:335: command_line: kexec console=ttyS0,115200
>   kexec_load: entry = 0x15dc000 flags = 0x102
>   nr_segments = 2
>   segment[0].buf   = 0x71a04010
>   segment[0].bufsz = 0x1de9a00
>   segment[0].mem   = 0x20
>   segment[0].memsz = 0x1f3
>   segment[1].buf   = 0x55b28098
>   segment[1].bufsz = 0x200
>   segment[1].mem   = 0x213
>   segment[1].memsz = 0x4000
> 
> Signed-off-by: WANG Rui 

Thanks, applied.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] kexec_file: add kexec_file flag to support debug printing

2023-11-27 Thread Simon Horman
On Tue, Nov 14, 2023 at 11:20:30PM +0800, Baoquan He wrote:
> This add KEXEC_FILE_DEBUG to kexec_file_flags so that it can be passed
> to kernel when '-d' is added with kexec_file_load interface. With that
> flag enabled, kernel can enable the debugging message printing.
> 
> Signed-off-by: Baoquan He 

Thanks,

this looks fine to me.
But perhaps the corresponding Kernel patch should land first?
Let me know what you think.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec