On Tue, Nov 26, 2024 at 12:34 PM Tao Liu <l...@redhat.com> wrote:

> Hi lianbo & guanyou,
>
> On Thu, Nov 21, 2024 at 8:56 PM Guanyou Chen <chenguanyou9...@gmail.com>
> wrote:
> >
> > Hi  Tao
> >
> > > 1. What is the root cause, why cpu offline state will lead to mapping
> > > errors? You didn't make this clear in your commit message.
> >
> > The back CPU context  wrong order When the front CPU is offline.
> > exp:
> > CPU0, CPU1 is offline,  CPU2 is online,  CPU2's nt_prstatus_percpu from
> corefile note prstatus[0], because  i, j not equal.
> >
> > > 2. Is this issue cpu arch specific? Since the example you gave is only
> > > arm64 specific, however the code change is generic for all archs. So
> > > I'm a bit worried about whether it will make the result incorrect for
> > > other archs. In addition, I didn't reproduce the mapping error issue
> > > on my x86_64 machine, so I cannot verify the patch locally.
> >
> > N/A
> >
> > > What command line you trigger will get the following outputs?
> > $ crash vmlinux vmcore -d 1
> > crash> help -r
> >
> > Thanks,
> > Guanyou
> >
> > Tao Liu <l...@redhat.com> 于2024年11月21日周四 13:42写道:
> >>
> >> Hi Guanyou,
> >>
> >> Thanks for the fix. However I have a few questions for this patch:
> >>
> >> On Fri, Nov 1, 2024 at 10:16 PM Guanyou Chen <chenguanyou9...@gmail.com>
> wrote:
> >> >
> >> > Hi  Lianbo, Tao
> >> >
> >> > When CPUs are in an offline state, it can lead to mapping errors.
> >>
> >> 1. What is the root cause, why cpu offline state will lead to mapping
> >> errors? You didn't make this clear in your commit message.
> >> 2. Is this issue cpu arch specific? Since the example you gave is only
> >> arm64 specific, however the code change is generic for all archs. So
> >> I'm a bit worried about whether it will make the result incorrect for
> >> other archs. In addition, I didn't reproduce the mapping error issue
> >> on my x86_64 machine, so I cannot verify the patch locally.
> >>
> >> > We need to map them to the correct positions one by one.
> >> >
> >>
> >> 3. Please make the commit log tidy. I know you want to diff the
> >> outputs of "before" and "after", but we only need the info which are
> >> highly related to this issue. e.g:
> >>
> >> > Before:
> >>
> >> What command line you trigger will get the following outputs?
> >>
> >> > n_namesz: 5 ("CPU2")
> >> > n_descsz: 392
> >> >   n_type: 1 (NT_PRSTATUS)
> >> >           si.signo: 0  si.code: 0  si.errno: 0
> >> >           cursig: 0  sigpend: 0  sighold: 0
> >> >           pid: 3  ppid: 0  pgrp: 0  sid:0
> >> >           utime: 0.000000  stime: 0.000000
> >> >           cutime: 0.000000  cstime: 0.000000
> >>
> >> I'm not sure if the above info is related to this issue?
> >>
> >> >            X0: ffffffc000fc8818   X1: 0000000000000000   X2:
> ffffffc000fc84c8
> >> >            X3: 0000000000000000   X4: ffffffc0405e37bf   X5:
> ffffffc00a07372f
> >> >            X6: 322e34323320205b   X7: 545b5d3539383334   X8:
> ffffffc000fc2f0c
> >> >            X9: 89fece0a9ef8cb00  X10: c0000001001f75f4  X11:
> 00000001001f75f4
> >> >           X12: 0000000000000003  X13: 00000000000005f4  X14:
> ffffffc009eb1210
> >> >           X15: 0000000000000004  X16: 000000002a4cec24  X17:
> 000000002a4cec24
> >> >           X18: ffffffc009e7d140  X19: ffffffc00a04c670  X20:
> 0000000000000000
> >> >           X21: 0000000000000000  X22: ffffff8027f22280  X23:
> 0000000000000009
> >> >           X24: 0000000000000007  X25: ffffffc009f839c0  X26:
> ffffffc0090f87f8
> >> >           X27: 0000000000000000  X28: ffffff80454f3840  X29:
> ffffffc0405e3b60
> >> >            LR: ffffffc0080e57fc   SP: ffffffc0405e3b60   PC:
> ffffffc000fc2f84
> >>
> >> We don't need a list of all regs, a few lines of which can indicate
> >> the output mismatch for "Before" and "After" is enough.
> >>
> >> >
> >> > CPU 0: [OFFLINE]
> >> > CPU 1: [OFFLINE]
> >> > CPU 2:
> >> >     X0: 0000000000000000   X1: 0000000000000000   X2: 0000000000000000
> >> >     X3: 000000000003fcbc   X4: 0000000000000001   X5: 0000000000000000
> >> >     X6: 0000000000000000   X7: 0000000000000000   X8: 00000000ffffffff
> >> >     X9: ffffffc009e6ae48  X10: ffffffc009e6ae20  X11: 0000000000000000
> >> >    X12: 0000000000000002  X13: 0000000000000004  X14: 0000000000000000
> >> >    X15: 0000000000004000  X16: 00000000f90f05f6  X17: 00000000f90f05f6
> >> >    X18: 0000000000000000  X19: 0000000000000002  X20: ffffffc009e3b008
> >> >    X21: ffffffc00a01d020  X22: ffffffc009f798f0  X23: 0000000060001000
> >> >    X24: 0000000000000000  X25: 0000000000000000  X26: 0000000000000000
> >> >    X27: 0000000000000000  X28: ffffff8111eecb00  X29: ffffffc008003f50
> >> >     LR: ffffffc00802df88   SP: ffffffc008003f40   PC: ffffffc00802df94
> >> >    PSTATE: 024003c5   FPVALID: 00000000
> >> >
> >> > After:
> >> > CPU 2:
> >> >     X0: ffffffc000fc8818   X1: 0000000000000000   X2: ffffffc000fc84c8
> >> >     X3: 0000000000000000   X4: ffffffc0405e37bf   X5: ffffffc00a07372f
> >> >     X6: 322e34323320205b   X7: 545b5d3539383334   X8: ffffffc000fc2f0c
> >> >     X9: 89fece0a9ef8cb00  X10: c0000001001f75f4  X11: 00000001001f75f4
> >> >    X12: 0000000000000003  X13: 00000000000005f4  X14: ffffffc009eb1210
> >> >    X15: 0000000000000004  X16: 000000002a4cec24  X17: 000000002a4cec24
> >> >    X18: ffffffc009e7d140  X19: ffffffc00a04c670  X20: 0000000000000000
> >> >    X21: 0000000000000000  X22: ffffff8027f22280  X23: 0000000000000009
> >> >    X24: 0000000000000007  X25: ffffffc009f839c0  X26: ffffffc0090f87f8
> >> >    X27: 0000000000000000  X28: ffffff80454f3840  X29: ffffffc0405e3b60
> >>
> >> >     LR: ffffffc0080e57fc   SP: ffffffc0405e3b60   PC: ffffffc000fc2f84
> >> >    PSTATE: 600000c5   FPVALID: 00000000
> >> >
> >> > crash> bt
> >> > PID: 15959    TASK: ffffff80454f3840  CPU: 2    COMMAND: "AnrConsumer"
> >> >     [ffffffc0405e3b60] ipanic at ffffffc000fc2f80 [mrdump]
> >> >     [ffffffc0405e3b70] atomic_notifier_call_chain at ffffffc0080e57f8
> >> >     [ffffffc0405e3c30] panic at ffffffc008f734d0
> >> >     [ffffffc0405e3c80] sysrq_handle_crash at ffffffc0087f3c18
> >> >     [ffffffc0405e3c90] __handle_sysrq at ffffffc0087f3798
> >> >     [ffffffc0405e3ce0] write_sysrq_trigger at ffffffc0087f49c0
> >> >     [ffffffc0405e3d00] proc_reg_write at ffffffc00842e4b8
> >> >     [ffffffc0405e3d80] vfs_write at ffffffc008381eb4
> >> >     [ffffffc0405e3dd0] ksys_write at ffffffc008382200
> >> >     [ffffffc0405e3e10] __arm64_sys_write at ffffffc00838228c
> >> >     [ffffffc0405e3e20] invoke_syscall at ffffffc00802efe0
> >> >     [ffffffc0405e3e40] el0_svc_common at ffffffc00802eef4
> >> >     [ffffffc0405e3e70] do_el0_svc at ffffffc00802ede8
> >> >     [ffffffc0405e3e80] el0_svc at ffffffc008f7a7d0
> >> >     [ffffffc0405e3ea0] el0t_64_sync_handler at ffffffc008f7a758
> >> >     [ffffffc0405e3fe0] el0t_64_sync at ffffffc00801157c
> >>
> >> The stacktrace is meaningless for this patch, we can cut this off.
> >>
> >> >      PC: 00000077c798ca28   LR: 00000077a82e19f4   SP:
> 000000761c517af0
> >> >     X29: 000000761c517b00  X28: 000000761c517db8  X27:
> 000000761c517c90
> >> >     X26: 000000761c517c98  X25: 000000761c517bf9  X24:
> 000000761c519000
> >> >     X23: 000000761c517be1  X22: 0000000000000001  X21:
> 00000000000003e3
> >> >     X20: 000000761c517c11  X19: 000000761c517bf8  X18:
> 0000007568224000
> >> >     X17: 00000077c798ca20  X16: 00000077c79b2ae0  X15:
> b4000077202cc480
> >> >     X14: 0000000000000000  X13: 000000761c517a70  X12:
> ffffff80ffffffd0
> >> >     X11: 000000761c517a40  X10: 0000000000000001   X9:
> 0000000000000000
> >> >      X8: 0000000000000040   X7: 7f7f7f7f7f7f7f7f   X6:
> 0000000000000010
> >> >      X5: 000000761c517c0c   X4: ffffffffffffffff   X3:
> ffffffffffffffff
> >> >      X2: 0000000000000001   X1: 000000761c517c11   X0:
> 00000000000003e3
> >> >     ORIG_X0: 00000000000003e3  SYSCALLNO: 40  PSTATE: 00001000
> >> >
> >> ditto.
> >>
> >> Maybe a short summary after the output paste, telling us which info
> >> should reviewers pay attention to.
> >>
> >> Thanks,
> >> Tao Liu
> >>
> >> > Signed-off-by: Guanyou.Chen <chenguan...@xiaomi.com>
> >> > ---
> >> >  netdump.c | 6 +++---
> >> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/netdump.c b/netdump.c
> >> > index b4e2a5c..8ea5159 100644
> >> > --- a/netdump.c
> >> > +++ b/netdump.c
> >> > @@ -75,7 +75,7 @@ void
> >> >  map_cpus_to_prstatus(void)
> >> >  {
> >> >     void **nt_ptr;
> >> > -   int online, i, j, nrcpus;
> >> > +   int online, i, nrcpus;
> >> >     size_t size;
> >> >
> >> >     if (pc->flags2 & QEMU_MEM_DUMP_ELF)  /* notes exist for all cpus
> */
> >> > @@ -100,9 +100,9 @@ map_cpus_to_prstatus(void)
> >> >      */
> >> >     nrcpus = (kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS);
> >> >
> >> > -   for (i = 0, j = 0; i < nrcpus; i++) {
> >> > +   for (i = 0; i < nrcpus; i++) {
> >> >         if (in_cpu_map(ONLINE_MAP, i) &&
> machdep->is_cpu_prstatus_valid(i)) {
> >> > -           nd->nt_prstatus_percpu[i] = nt_ptr[j++];
> >> > +           nd->nt_prstatus_percpu[i] = nt_ptr[i];
>
> I re-think about the code change, it seems to me the variable 'j'
> doesn't make any sense here.
>
> nt_ptr = (void **)GETBUF(size);
> BCOPY(nd->nt_prstatus_percpu, nt_ptr, size);
> BZERO(nd->nt_prstatus_percpu, size);
>
> The data of nt_ptr buffer is copied from nd->nt_prstatus_percpu
> buffer, then nd->nt_prstatus_percpu buffer is cleared. If cpu 0 1 2 4
> is online and 3 is offline, nd->nt_prstatus_percpu[4] = nt_ptr[3];
> that is, copy the original cpu3's info into current cpu4's buf. This
> cpu shift operation doesn't make any sense to me,
> nd->nt_prstatus_percpu[4] = nt_ptr[4] should make sense in this case.
> By searching the code, I didn't see nd->nt_prstatus_percpu[cpu] except
> for any cpu shifts.
>
> However I cannot find out the git history of why variable 'j' was
> introduced. So this patch looks correct to me, though I haven't run
> the overall regression test against it. What's your opinion on the
> variable 'j'? In addition, I can do a regression test if needed.
> @Lianbo Jiang
>
>
The variable 'j' was introduced into crash utility with the related ppc64
patch changes together, however this changed the corresponding relationship
between the cpu and nt_prstatus_percpu[], in addition, the
num_prstatus_notes was also changed(only include online cpus' notes).

For ppc64, it has a fadump case, which does not depend on the crash_notes.

I guess we can not remove it simply, and need to consider it more:
[1] why did the num_prstatus_notes need to be changed?
[2] why did we copy the values of nt_prstatus_percpu[] to another array
nt_ptr instead of changing the values on it directly? Can the variable
'nt_ptr' be dropped?
[3] how to handle the code related to this?

There is another similar code in diskdump.c, which seems need to be
> updated as well.
>

It's true. I also saw the similar code in ppc.

Thanks
Lianbo


> Thanks,
> Tao Liu
>
>
> >> >             nd->num_prstatus_notes =
> >> >                 MAX(nd->num_prstatus_notes, i+1);
> >> >         }
> >> > --
> >> > 2.34.1
> >> >
> >> > Guanyou.
> >> > Thanks.
> >>
>
>
--
Crash-utility mailing list -- devel@lists.crash-utility.osci.io
To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki

Reply via email to