numa: Register information into Kmemdump

David Hildenbrand Wed, 27 Aug 2025 05:34:23 -0700

On 27.08.25 13:59, Eugen Hristev wrote:



On 8/25/25 16:58, David Hildenbrand wrote:

On 25.08.25 15:36, Eugen Hristev wrote:



On 8/25/25 16:20, David Hildenbrand wrote:


IIRC, kernel/vmcore_info.c is never built as a module, as it also
accesses non-exported symbols.


Hello David,

I am looking again into this, and there are some things which in my
opinion would be difficult to achieve.
For example I looked into my patch #11 , which adds the `runqueues` into
kmemdump.

The runqueues is a variable of `struct rq` which is defined in
kernel/sched/sched.h , which is not supposed to be included outside of
sched.
Now moving all the struct definition outside of sched.h into another
public header would be rather painful and I don't think it's a really
good option (The struct would be needed to compute the sizeof inside
vmcoreinfo). Secondly, it would also imply moving all the nested struct
definitions outside as well. I doubt this is something that we want for
the sched subsys. How the subsys is designed, out of my understanding,
is to keep these internal structs opaque outside of it.


All the kmemdump module needs is a start and a length, correct? So the
only tricky part is getting the length.


I also have in mind the kernel user case. How would a kernel programmer
want to add some kernel structs/info/buffers into kmemdump such that the
dump would contain their data ? Having "KMEMDUMP_VAR(...)" looks simple
enough.


The other way around, why should anybody have a saying in adding their
data to kmemdump? Why do we have that all over the kernel?

Is your mechanism really so special?

A single composer should take care of that, and it's really just start +
len of physical memory areas.

Otherwise maybe the programmer has to write helpers to compute lengths
etc, and stitch them into kmemdump core.
I am not saying it's impossible, but just tiresome perhaps.


In your patch set, how many of these instances did you encounter where
that was a problem?


One could just add a const variable that holds this information, or even
better, a simple helper function to calculate that.

Maybe someone else reading along has a better idea.


This could work, but it requires again adding some code into the
specific subsystem. E.g. struct_rq_get_size()
I am open to ideas , and thank you very much for your thoughts.


Interestingly, runqueues is a percpu variable, which makes me wonder if
what you had would work as intended (maybe it does, not sure).


I would not really need to dump the runqueues. But the crash tool which
I am using for testing, requires it. Without the runqueues it will not
progress further to load the kernel dump.
So I am not really sure what it does with the runqueues, but it works.
Perhaps using crash/gdb more, to actually do something with this data,
would give more insight about its utility.
For me, it is a prerequisite to run crash, and then to be able to
extract the log buffer from the dump.


I have the faint recollection that percpu vars might not be stored in a
single contiguous physical memory area, but maybe my memory is just
wrong, that's why I was raising it.


   From my perspective it's much simpler and cleaner to just add the
kmemdump annotation macro inside the sched/core.c as it's done in my
patch. This macro translates to a noop if kmemdump is not selected.


I really don't like how we are spreading kmemdump all over the kernel,
and adding complexity with __section when really, all we need is a place
to obtain a start and a length.


I understand. The section idea was suggested by Thomas. Initially I was
skeptic, but I like how it turned out.


Yeah, I don't like it. Taste differs ;)

I am in particular unhappy about custom memblock wrappers.

[...]


To have this working outside of printk, it would be required to walk
through all the printk structs/allocations and select the required info.
Is this something that we want to do outside of printk ?


I don't follow, please elaborate.

How is e.g., log_buf_len_get() + log_buf_addr_get() not sufficient,
given that you run your initialization after setup_log_buf() ?


My initial thought was the same. However I got some feedback from Petr
Mladek here :

https://lore.kernel.org/lkml/aBm5QH2p6p9Wxe_M@localhost.localdomain/

Where he explained how to register the structs correctly.
It can be that setup_log_buf is called again at a later time perhaps.


setup_log_buf() is a __init function, so there is only a certain time
frame where it can be called.

In particular, once the buddy is up, memblock allocations are impossible
and it would be deeply flawed to call this function again.

Let's not over-engineer this.

Peter is on CC, so hopefully he can share his thoughts.


Hello David,

I tested out this snippet (on top of my series, so you can see what I
changed):


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18ba6c1e174f..7ac4248a00e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -67,7 +67,6 @@
  #include <linux/wait_api.h>
  #include <linux/workqueue_api.h>
  #include <linux/livepatch_sched.h>
-#include <linux/kmemdump.h>

  #ifdef CONFIG_PREEMPT_DYNAMIC
  # ifdef CONFIG_GENERIC_IRQ_ENTRY
@@ -120,7 +119,12 @@
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-KMEMDUMP_VAR_CORE(runqueues, sizeof(runqueues));
+
+size_t runqueues_get_size(void);
+size_t runqueues_get_size(void)
+{
+       return sizeof(runqueues);
+}

  #ifdef CONFIG_SCHED_PROXY_EXEC
  DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index d808c5e67f35..c6dd2d6e96dd 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -24,6 +24,12 @@
  #include "kallsyms_internal.h"
  #include "kexec_internal.h"

+typedef void* kmemdump_opaque_t;
+
+size_t runqueues_get_size(void);
+
+extern kmemdump_opaque_t runqueues;


I would have tried that through:

struct rq;
extern struct rq runqueues;

But the whole PER_CPU_SHARED_ALIGNED makes this all weird, and likely
not the way we would want to handle that.

  /* vmcoreinfo stuff */
  unsigned char *vmcoreinfo_data;
  size_t vmcoreinfo_size;
@@ -230,6 +236,9 @@ static int __init crash_save_vmcoreinfo_init(void)

         kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_VMCOREINFO,
                              (void *)vmcoreinfo_data, vmcoreinfo_size);
+       kmemdump_register_id(KMEMDUMP_ID_COREIMAGE_runqueues,
+                            (void *)&runqueues, runqueues_get_size());
+
         return 0;
  }

With this, no more .section, no kmemdump code into sched, however, there
are few things :


I would really just do here something like the following:

/**
 * sched_get_runqueues_area - obtain the runqueues area for dumping
 * @start: ...
 * @size: ...
 *
 * The obtained area is only to be used for dumping purposes.
 */
void sched_get_runqueues_area(void *start, size_t size)
{
        start = &runqueues;
        size = sizeof(runqueues);
}

might be cleaner.

Having said that, if you realize that there is a fundamental issue withwhat I propose, please speak up.

So far, I feel like there are only limited number of "suboptimal" casesof this kind, but I might be wrong of course.


--
Cheers

David / dhildenb

Re: [RFC][PATCH v2 22/29] mm/numa: Register information into Kmemdump

Reply via email to