[Crash-utility] Re: [PATCH] Fix for "runq -g" option failure

lijiang Thu, 20 Nov 2025 23:46:01 -0800

On Thu, Nov 20, 2025 at 9:05 AM HAGIO KAZUHITO(萩尾 一仁) <[email protected]>
wrote:


> On 2025/11/19 16:39, lijiang wrote:
> > On Wed, Nov 19, 2025 at 12:50 PM HAGIO KAZUHITO(萩尾 一仁) <
> [email protected] <mailto:[email protected]>> wrote:
> >
> >     On 2025/11/18 17:55, Lianbo Jiang wrote:
> >      > The "runq -g" option may fail on some vmcores from customers, and
> report
> >      > the following error:
> >      >
> >      >    crash> runq -g
> >      >    ...
> >      >        malloc_bp[1998]: 11592c20
> >      >        malloc_bp[1999]: 11662490
> >      >    ...
> >      >        average size: 11922
> >      >      runq: cannot allocate any more memory!
> >      >
> >      > This is because the maximum number of malloc() was reached through
> >      > GETBUF(), currently which is limited to MAX_MALLOC_BUFS(2000).
> >      > Furthermore, the error messages is not very clear.
> >      >
> >      > Given that, let's expand the limitation of MAX_MALLOC_BUFS and
> make the
> >      > error message clear and concise.
> >
> >     Hi Lianbo,
> >
> >     out of curiosity, does this mean that the cause is clear and there
> >     is no other way to fix the issue?  IOW, is there no buffer leak,
> >     wasteful GETBUF or etc?
> >     I'm sorry if you have already investigated them.
> >
> >
> > Good questions, Kazu.
> > So far I haven't got the better way to fix it, the malloc_bp will be
> exhausted when running the runq -g, and
> > I did not see the buffer leak(malloc_bp) on the specific code path(if
> anybody finds it, please let me know).
> >
> >
> >     Generally, relaxing a limitation is the last resort, I think,
> >     because limitations are kind of safety mechanism.  Also, relaxing
> >     the limitation may be a stopgap solution for the vmcore.  If you
> >
> >
> > Agree with you.
> >
> >     get another vmcore hitting this again, do you relax it again?
> >
> > That needs to  be considered according to the actual situation, against
> the current case, if the limitation is not expanded, probably we have to
> tell customers that the "runq -g" can not work because of the max
> limitation of MAX_MALLOC_BUFS(2000).
> >
> > BTW: for some large-scale servers equipped with multi-core(even hundreds
> of cpus) running thousands of tasks, and utilizing the task group, the max
> value of 2000 is really too small, therefore, it could be good to increase
> it appropriately.
> >
>
> Thank you for the reply, Lianbo.
>
> Sure, if there is no better way, we need to expand the limitation.
> My question was, if so, what does the number of GETBUFs grow in
> proportion to in the "runq -g" option?
>
>
 I did not make an accurate count of that, but roughly it should be related
to the number of runq and *tasks* in the task group.


> Also, it looks like the "runq -g" has recursive calls, I thought that
>

You are right, Kazu. There are several recursive calls in the
dump_tasks_by_task_group().

there might be GETBUFs that can be reduced.
>
I'm not sure which GETBUF causes the issue and this is just an example,
> I found a buf which goes into a recursive call.  If recursive calls with
> the buf causes the issue, maybe we can reduce them.
>

Thanks for sharing your thoughts. I will look at it later.


> (but this may have a trade-off between memory and speed, there is need
> to check whether we can accept it, though.)
>
> --- a/task.c
> +++ b/task.c
> @@ -10086,9 +10086,6 @@ dump_tasks_in_task_group_rt_rq(int depth, ulong
> rt_rq, int cpu)
>          char *rt_rq_buf, *u_prio_array;
>
>          k_prio_array = rt_rq +  OFFSET(rt_rq_active);
> -       rt_rq_buf = GETBUF(SIZE(rt_rq));
> -       readmem(rt_rq, KVADDR, rt_rq_buf, SIZE(rt_rq), "rt_rq",
> FAULT_ON_ERROR);
> -       u_prio_array = &rt_rq_buf[OFFSET(rt_rq_active)];
>
>          if (depth) {
>                  readmem(rt_rq + OFFSET(rt_rq_tg), KVADDR,
> @@ -10111,8 +10108,8 @@ dump_tasks_in_task_group_rt_rq(int depth, ulong
> rt_rq, int cpu)
>          for (i = tot = 0; i < qheads; i++) {
>                  offset =  OFFSET(rt_prio_array_queue) + (i *
> SIZE(list_head));
>                  kvaddr = k_prio_array + offset;
> -               uvaddr = (ulong)u_prio_array + offset;
> -               BCOPY((char *)uvaddr, (char *)&list_head[0],
> sizeof(ulong)*2);
> +               readmem(rt_rq + OFFSET(rt_rq_active) + offset, KVADDR,
> &list_head,
> +                       sizeof(ulong)*2, "rt_prio_array queue[]",
> FAULT_ON_ERROR);
>
>                  if (CRASHDEBUG(1))
>                          fprintf(fp, "rt_prio_array[%d] @ %lx =>
> %lx/%lx\n",
> @@ -10169,7 +10166,6 @@ is_task:
>                  INDENT(5 + 6 * depth);
>                  fprintf(fp, "[no tasks queued]\n");
>          }
> -       FREEBUF(rt_rq_buf);
>   }
>
>   static char *
>
>
> Like this, if the number of GETBUFs grow depending on some data/code
> structures, there might be a way to avoid it by code work.
>
> The crash-utility handles various vmcores, it may have a broken or
> unexpected structure.  The limitation can avoid a lot of malloc calls
> for such unexpected data.  so if a lot of GETBUFs are required, we
> should check whether the code is reasonable enough first, imho.
> But yes, if it's hard to change the code, it's good to change the
> limitation.
>
Thanks,
> Kazu
>
> >
> > Thanks
> > Lianbo
> >
> >
> >     Thanks,
> >     Kazu
> >
> >      >
> >      > With the patch:
> >      >    crash> runq -g
> >      >    ...
> >      >    CPU 95
> >      >      CURRENT: PID: 64281  TASK: ffff9f541b064000  COMMAND:
> "xxx_64281_sv"
> >      >      ROOT_TASK_GROUP: ffffffffa64ff940  RT_RQ: ffff9f86bfdf3a80
> >      >         [no tasks queued]
> >      >      ROOT_TASK_GROUP: ffffffffa64ff940  CFS_RQ: ffff9f86bfdf38c0
> >      >         [120] PID: 64281  TASK: ffff9f541b064000  COMMAND:
> "xxx_64281_sv" [CURRENT]
> >      >         TASK_GROUP: ffff9f47cb3b9180  CFS_RQ: ffff9f67c0417a00
> <user.slice>
> >      >            [120] PID: 65275  TASK: ffff9f6820208000  COMMAND:
> "server"
> >      >         TASK_GROUP: ffff9f67f9ac2300  CFS_RQ: ffff9f6803662000
> <oratfagroup>
> >      >            [120] PID: 1209636  TASK: ffff9f582f25c000  COMMAND:
> "crsctl"
> >      >
> >      > Reported-by: Buland Kumar Singh <[email protected] <mailto:
> [email protected]>>
> >      > Signed-off-by: Lianbo Jiang <[email protected] <mailto:
> [email protected]>>
> >      > ---
> >      >   tools.c | 4 ++--
> >      >   1 file changed, 2 insertions(+), 2 deletions(-)
> >      >
> >      > diff --git a/tools.c b/tools.c
> >      > index a9ad18d520d9..6676881c182a 100644
> >      > --- a/tools.c
> >      > +++ b/tools.c
> >      > @@ -5698,7 +5698,7 @@ ll_power(long long base, long long exp)
> >      >   #define B32K (4)
> >      >
> >      >   #define SHARED_BUF_SIZES  (B32K+1)
> >      > -#define MAX_MALLOC_BUFS   (2000)
> >      > +#define MAX_MALLOC_BUFS   (3072)
> >      >   #define MAX_CACHE_SIZE    (KILOBYTES(32))
> >      >
> >      >   struct shared_bufs {
> >      > @@ -6130,7 +6130,7 @@ getbuf(long reqsize)
> >      >       dump_shared_bufs();
> >      >
> >      >       return ((char *)(long)
> >      > -             error(FATAL, "cannot allocate any more memory!\n"));
> >      > +             error(FATAL, "cannot allocate any more memory,
> reached to max numbers of malloc() via GETBUF()!\n"));
> >      >   }
> >      >
> >      >   /*
> >

--
Crash-utility mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki

[Crash-utility] Re: [PATCH] Fix for "runq -g" option failure

Reply via email to