On Thu, Nov 20, 2025 at 9:05 AM HAGIO KAZUHITO(萩尾 一仁) <[email protected]> wrote:
> On 2025/11/19 16:39, lijiang wrote: > > On Wed, Nov 19, 2025 at 12:50 PM HAGIO KAZUHITO(萩尾 一仁) < > [email protected] <mailto:[email protected]>> wrote: > > > > On 2025/11/18 17:55, Lianbo Jiang wrote: > > > The "runq -g" option may fail on some vmcores from customers, and > report > > > the following error: > > > > > > crash> runq -g > > > ... > > > malloc_bp[1998]: 11592c20 > > > malloc_bp[1999]: 11662490 > > > ... > > > average size: 11922 > > > runq: cannot allocate any more memory! > > > > > > This is because the maximum number of malloc() was reached through > > > GETBUF(), currently which is limited to MAX_MALLOC_BUFS(2000). > > > Furthermore, the error messages is not very clear. > > > > > > Given that, let's expand the limitation of MAX_MALLOC_BUFS and > make the > > > error message clear and concise. > > > > Hi Lianbo, > > > > out of curiosity, does this mean that the cause is clear and there > > is no other way to fix the issue? IOW, is there no buffer leak, > > wasteful GETBUF or etc? > > I'm sorry if you have already investigated them. > > > > > > Good questions, Kazu. > > So far I haven't got the better way to fix it, the malloc_bp will be > exhausted when running the runq -g, and > > I did not see the buffer leak(malloc_bp) on the specific code path(if > anybody finds it, please let me know). > > > > > > Generally, relaxing a limitation is the last resort, I think, > > because limitations are kind of safety mechanism. Also, relaxing > > the limitation may be a stopgap solution for the vmcore. If you > > > > > > Agree with you. > > > > get another vmcore hitting this again, do you relax it again? > > > > That needs to be considered according to the actual situation, against > the current case, if the limitation is not expanded, probably we have to > tell customers that the "runq -g" can not work because of the max > limitation of MAX_MALLOC_BUFS(2000). > > > > BTW: for some large-scale servers equipped with multi-core(even hundreds > of cpus) running thousands of tasks, and utilizing the task group, the max > value of 2000 is really too small, therefore, it could be good to increase > it appropriately. > > > > Thank you for the reply, Lianbo. > > Sure, if there is no better way, we need to expand the limitation. > My question was, if so, what does the number of GETBUFs grow in > proportion to in the "runq -g" option? > > I did not make an accurate count of that, but roughly it should be related to the number of runq and *tasks* in the task group. > Also, it looks like the "runq -g" has recursive calls, I thought that > You are right, Kazu. There are several recursive calls in the dump_tasks_by_task_group(). there might be GETBUFs that can be reduced. > I'm not sure which GETBUF causes the issue and this is just an example, > I found a buf which goes into a recursive call. If recursive calls with > the buf causes the issue, maybe we can reduce them. > Thanks for sharing your thoughts. I will look at it later. > (but this may have a trade-off between memory and speed, there is need > to check whether we can accept it, though.) > > --- a/task.c > +++ b/task.c > @@ -10086,9 +10086,6 @@ dump_tasks_in_task_group_rt_rq(int depth, ulong > rt_rq, int cpu) > char *rt_rq_buf, *u_prio_array; > > k_prio_array = rt_rq + OFFSET(rt_rq_active); > - rt_rq_buf = GETBUF(SIZE(rt_rq)); > - readmem(rt_rq, KVADDR, rt_rq_buf, SIZE(rt_rq), "rt_rq", > FAULT_ON_ERROR); > - u_prio_array = &rt_rq_buf[OFFSET(rt_rq_active)]; > > if (depth) { > readmem(rt_rq + OFFSET(rt_rq_tg), KVADDR, > @@ -10111,8 +10108,8 @@ dump_tasks_in_task_group_rt_rq(int depth, ulong > rt_rq, int cpu) > for (i = tot = 0; i < qheads; i++) { > offset = OFFSET(rt_prio_array_queue) + (i * > SIZE(list_head)); > kvaddr = k_prio_array + offset; > - uvaddr = (ulong)u_prio_array + offset; > - BCOPY((char *)uvaddr, (char *)&list_head[0], > sizeof(ulong)*2); > + readmem(rt_rq + OFFSET(rt_rq_active) + offset, KVADDR, > &list_head, > + sizeof(ulong)*2, "rt_prio_array queue[]", > FAULT_ON_ERROR); > > if (CRASHDEBUG(1)) > fprintf(fp, "rt_prio_array[%d] @ %lx => > %lx/%lx\n", > @@ -10169,7 +10166,6 @@ is_task: > INDENT(5 + 6 * depth); > fprintf(fp, "[no tasks queued]\n"); > } > - FREEBUF(rt_rq_buf); > } > > static char * > > > Like this, if the number of GETBUFs grow depending on some data/code > structures, there might be a way to avoid it by code work. > > The crash-utility handles various vmcores, it may have a broken or > unexpected structure. The limitation can avoid a lot of malloc calls > for such unexpected data. so if a lot of GETBUFs are required, we > should check whether the code is reasonable enough first, imho. > But yes, if it's hard to change the code, it's good to change the > limitation. > Thanks, > Kazu > > > > > Thanks > > Lianbo > > > > > > Thanks, > > Kazu > > > > > > > > With the patch: > > > crash> runq -g > > > ... > > > CPU 95 > > > CURRENT: PID: 64281 TASK: ffff9f541b064000 COMMAND: > "xxx_64281_sv" > > > ROOT_TASK_GROUP: ffffffffa64ff940 RT_RQ: ffff9f86bfdf3a80 > > > [no tasks queued] > > > ROOT_TASK_GROUP: ffffffffa64ff940 CFS_RQ: ffff9f86bfdf38c0 > > > [120] PID: 64281 TASK: ffff9f541b064000 COMMAND: > "xxx_64281_sv" [CURRENT] > > > TASK_GROUP: ffff9f47cb3b9180 CFS_RQ: ffff9f67c0417a00 > <user.slice> > > > [120] PID: 65275 TASK: ffff9f6820208000 COMMAND: > "server" > > > TASK_GROUP: ffff9f67f9ac2300 CFS_RQ: ffff9f6803662000 > <oratfagroup> > > > [120] PID: 1209636 TASK: ffff9f582f25c000 COMMAND: > "crsctl" > > > > > > Reported-by: Buland Kumar Singh <[email protected] <mailto: > [email protected]>> > > > Signed-off-by: Lianbo Jiang <[email protected] <mailto: > [email protected]>> > > > --- > > > tools.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/tools.c b/tools.c > > > index a9ad18d520d9..6676881c182a 100644 > > > --- a/tools.c > > > +++ b/tools.c > > > @@ -5698,7 +5698,7 @@ ll_power(long long base, long long exp) > > > #define B32K (4) > > > > > > #define SHARED_BUF_SIZES (B32K+1) > > > -#define MAX_MALLOC_BUFS (2000) > > > +#define MAX_MALLOC_BUFS (3072) > > > #define MAX_CACHE_SIZE (KILOBYTES(32)) > > > > > > struct shared_bufs { > > > @@ -6130,7 +6130,7 @@ getbuf(long reqsize) > > > dump_shared_bufs(); > > > > > > return ((char *)(long) > > > - error(FATAL, "cannot allocate any more memory!\n")); > > > + error(FATAL, "cannot allocate any more memory, > reached to max numbers of malloc() via GETBUF()!\n")); > > > } > > > > > > /* > >
-- Crash-utility mailing list -- [email protected] To unsubscribe send an email to [email protected] https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki
