On Tue, Nov 26, 2024 at 12:34 PM Tao Liu <l...@redhat.com> wrote: > Hi lianbo & guanyou, > > On Thu, Nov 21, 2024 at 8:56 PM Guanyou Chen <chenguanyou9...@gmail.com> > wrote: > > > > Hi Tao > > > > > 1. What is the root cause, why cpu offline state will lead to mapping > > > errors? You didn't make this clear in your commit message. > > > > The back CPU context wrong order When the front CPU is offline. > > exp: > > CPU0, CPU1 is offline, CPU2 is online, CPU2's nt_prstatus_percpu from > corefile note prstatus[0], because i, j not equal. > > > > > 2. Is this issue cpu arch specific? Since the example you gave is only > > > arm64 specific, however the code change is generic for all archs. So > > > I'm a bit worried about whether it will make the result incorrect for > > > other archs. In addition, I didn't reproduce the mapping error issue > > > on my x86_64 machine, so I cannot verify the patch locally. > > > > N/A > > > > > What command line you trigger will get the following outputs? > > $ crash vmlinux vmcore -d 1 > > crash> help -r > > > > Thanks, > > Guanyou > > > > Tao Liu <l...@redhat.com> 于2024年11月21日周四 13:42写道: > >> > >> Hi Guanyou, > >> > >> Thanks for the fix. However I have a few questions for this patch: > >> > >> On Fri, Nov 1, 2024 at 10:16 PM Guanyou Chen <chenguanyou9...@gmail.com> > wrote: > >> > > >> > Hi Lianbo, Tao > >> > > >> > When CPUs are in an offline state, it can lead to mapping errors. > >> > >> 1. What is the root cause, why cpu offline state will lead to mapping > >> errors? You didn't make this clear in your commit message. > >> 2. Is this issue cpu arch specific? Since the example you gave is only > >> arm64 specific, however the code change is generic for all archs. So > >> I'm a bit worried about whether it will make the result incorrect for > >> other archs. In addition, I didn't reproduce the mapping error issue > >> on my x86_64 machine, so I cannot verify the patch locally. > >> > >> > We need to map them to the correct positions one by one. > >> > > >> > >> 3. Please make the commit log tidy. I know you want to diff the > >> outputs of "before" and "after", but we only need the info which are > >> highly related to this issue. e.g: > >> > >> > Before: > >> > >> What command line you trigger will get the following outputs? > >> > >> > n_namesz: 5 ("CPU2") > >> > n_descsz: 392 > >> > n_type: 1 (NT_PRSTATUS) > >> > si.signo: 0 si.code: 0 si.errno: 0 > >> > cursig: 0 sigpend: 0 sighold: 0 > >> > pid: 3 ppid: 0 pgrp: 0 sid:0 > >> > utime: 0.000000 stime: 0.000000 > >> > cutime: 0.000000 cstime: 0.000000 > >> > >> I'm not sure if the above info is related to this issue? > >> > >> > X0: ffffffc000fc8818 X1: 0000000000000000 X2: > ffffffc000fc84c8 > >> > X3: 0000000000000000 X4: ffffffc0405e37bf X5: > ffffffc00a07372f > >> > X6: 322e34323320205b X7: 545b5d3539383334 X8: > ffffffc000fc2f0c > >> > X9: 89fece0a9ef8cb00 X10: c0000001001f75f4 X11: > 00000001001f75f4 > >> > X12: 0000000000000003 X13: 00000000000005f4 X14: > ffffffc009eb1210 > >> > X15: 0000000000000004 X16: 000000002a4cec24 X17: > 000000002a4cec24 > >> > X18: ffffffc009e7d140 X19: ffffffc00a04c670 X20: > 0000000000000000 > >> > X21: 0000000000000000 X22: ffffff8027f22280 X23: > 0000000000000009 > >> > X24: 0000000000000007 X25: ffffffc009f839c0 X26: > ffffffc0090f87f8 > >> > X27: 0000000000000000 X28: ffffff80454f3840 X29: > ffffffc0405e3b60 > >> > LR: ffffffc0080e57fc SP: ffffffc0405e3b60 PC: > ffffffc000fc2f84 > >> > >> We don't need a list of all regs, a few lines of which can indicate > >> the output mismatch for "Before" and "After" is enough. > >> > >> > > >> > CPU 0: [OFFLINE] > >> > CPU 1: [OFFLINE] > >> > CPU 2: > >> > X0: 0000000000000000 X1: 0000000000000000 X2: 0000000000000000 > >> > X3: 000000000003fcbc X4: 0000000000000001 X5: 0000000000000000 > >> > X6: 0000000000000000 X7: 0000000000000000 X8: 00000000ffffffff > >> > X9: ffffffc009e6ae48 X10: ffffffc009e6ae20 X11: 0000000000000000 > >> > X12: 0000000000000002 X13: 0000000000000004 X14: 0000000000000000 > >> > X15: 0000000000004000 X16: 00000000f90f05f6 X17: 00000000f90f05f6 > >> > X18: 0000000000000000 X19: 0000000000000002 X20: ffffffc009e3b008 > >> > X21: ffffffc00a01d020 X22: ffffffc009f798f0 X23: 0000000060001000 > >> > X24: 0000000000000000 X25: 0000000000000000 X26: 0000000000000000 > >> > X27: 0000000000000000 X28: ffffff8111eecb00 X29: ffffffc008003f50 > >> > LR: ffffffc00802df88 SP: ffffffc008003f40 PC: ffffffc00802df94 > >> > PSTATE: 024003c5 FPVALID: 00000000 > >> > > >> > After: > >> > CPU 2: > >> > X0: ffffffc000fc8818 X1: 0000000000000000 X2: ffffffc000fc84c8 > >> > X3: 0000000000000000 X4: ffffffc0405e37bf X5: ffffffc00a07372f > >> > X6: 322e34323320205b X7: 545b5d3539383334 X8: ffffffc000fc2f0c > >> > X9: 89fece0a9ef8cb00 X10: c0000001001f75f4 X11: 00000001001f75f4 > >> > X12: 0000000000000003 X13: 00000000000005f4 X14: ffffffc009eb1210 > >> > X15: 0000000000000004 X16: 000000002a4cec24 X17: 000000002a4cec24 > >> > X18: ffffffc009e7d140 X19: ffffffc00a04c670 X20: 0000000000000000 > >> > X21: 0000000000000000 X22: ffffff8027f22280 X23: 0000000000000009 > >> > X24: 0000000000000007 X25: ffffffc009f839c0 X26: ffffffc0090f87f8 > >> > X27: 0000000000000000 X28: ffffff80454f3840 X29: ffffffc0405e3b60 > >> > >> > LR: ffffffc0080e57fc SP: ffffffc0405e3b60 PC: ffffffc000fc2f84 > >> > PSTATE: 600000c5 FPVALID: 00000000 > >> > > >> > crash> bt > >> > PID: 15959 TASK: ffffff80454f3840 CPU: 2 COMMAND: "AnrConsumer" > >> > [ffffffc0405e3b60] ipanic at ffffffc000fc2f80 [mrdump] > >> > [ffffffc0405e3b70] atomic_notifier_call_chain at ffffffc0080e57f8 > >> > [ffffffc0405e3c30] panic at ffffffc008f734d0 > >> > [ffffffc0405e3c80] sysrq_handle_crash at ffffffc0087f3c18 > >> > [ffffffc0405e3c90] __handle_sysrq at ffffffc0087f3798 > >> > [ffffffc0405e3ce0] write_sysrq_trigger at ffffffc0087f49c0 > >> > [ffffffc0405e3d00] proc_reg_write at ffffffc00842e4b8 > >> > [ffffffc0405e3d80] vfs_write at ffffffc008381eb4 > >> > [ffffffc0405e3dd0] ksys_write at ffffffc008382200 > >> > [ffffffc0405e3e10] __arm64_sys_write at ffffffc00838228c > >> > [ffffffc0405e3e20] invoke_syscall at ffffffc00802efe0 > >> > [ffffffc0405e3e40] el0_svc_common at ffffffc00802eef4 > >> > [ffffffc0405e3e70] do_el0_svc at ffffffc00802ede8 > >> > [ffffffc0405e3e80] el0_svc at ffffffc008f7a7d0 > >> > [ffffffc0405e3ea0] el0t_64_sync_handler at ffffffc008f7a758 > >> > [ffffffc0405e3fe0] el0t_64_sync at ffffffc00801157c > >> > >> The stacktrace is meaningless for this patch, we can cut this off. > >> > >> > PC: 00000077c798ca28 LR: 00000077a82e19f4 SP: > 000000761c517af0 > >> > X29: 000000761c517b00 X28: 000000761c517db8 X27: > 000000761c517c90 > >> > X26: 000000761c517c98 X25: 000000761c517bf9 X24: > 000000761c519000 > >> > X23: 000000761c517be1 X22: 0000000000000001 X21: > 00000000000003e3 > >> > X20: 000000761c517c11 X19: 000000761c517bf8 X18: > 0000007568224000 > >> > X17: 00000077c798ca20 X16: 00000077c79b2ae0 X15: > b4000077202cc480 > >> > X14: 0000000000000000 X13: 000000761c517a70 X12: > ffffff80ffffffd0 > >> > X11: 000000761c517a40 X10: 0000000000000001 X9: > 0000000000000000 > >> > X8: 0000000000000040 X7: 7f7f7f7f7f7f7f7f X6: > 0000000000000010 > >> > X5: 000000761c517c0c X4: ffffffffffffffff X3: > ffffffffffffffff > >> > X2: 0000000000000001 X1: 000000761c517c11 X0: > 00000000000003e3 > >> > ORIG_X0: 00000000000003e3 SYSCALLNO: 40 PSTATE: 00001000 > >> > > >> ditto. > >> > >> Maybe a short summary after the output paste, telling us which info > >> should reviewers pay attention to. > >> > >> Thanks, > >> Tao Liu > >> > >> > Signed-off-by: Guanyou.Chen <chenguan...@xiaomi.com> > >> > --- > >> > netdump.c | 6 +++--- > >> > 1 file changed, 3 insertions(+), 3 deletions(-) > >> > > >> > diff --git a/netdump.c b/netdump.c > >> > index b4e2a5c..8ea5159 100644 > >> > --- a/netdump.c > >> > +++ b/netdump.c > >> > @@ -75,7 +75,7 @@ void > >> > map_cpus_to_prstatus(void) > >> > { > >> > void **nt_ptr; > >> > - int online, i, j, nrcpus; > >> > + int online, i, nrcpus; > >> > size_t size; > >> > > >> > if (pc->flags2 & QEMU_MEM_DUMP_ELF) /* notes exist for all cpus > */ > >> > @@ -100,9 +100,9 @@ map_cpus_to_prstatus(void) > >> > */ > >> > nrcpus = (kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS); > >> > > >> > - for (i = 0, j = 0; i < nrcpus; i++) { > >> > + for (i = 0; i < nrcpus; i++) { > >> > if (in_cpu_map(ONLINE_MAP, i) && > machdep->is_cpu_prstatus_valid(i)) { > >> > - nd->nt_prstatus_percpu[i] = nt_ptr[j++]; > >> > + nd->nt_prstatus_percpu[i] = nt_ptr[i]; > > I re-think about the code change, it seems to me the variable 'j' > doesn't make any sense here. > > nt_ptr = (void **)GETBUF(size); > BCOPY(nd->nt_prstatus_percpu, nt_ptr, size); > BZERO(nd->nt_prstatus_percpu, size); > > The data of nt_ptr buffer is copied from nd->nt_prstatus_percpu > buffer, then nd->nt_prstatus_percpu buffer is cleared. If cpu 0 1 2 4 > is online and 3 is offline, nd->nt_prstatus_percpu[4] = nt_ptr[3]; > that is, copy the original cpu3's info into current cpu4's buf. This > cpu shift operation doesn't make any sense to me, > nd->nt_prstatus_percpu[4] = nt_ptr[4] should make sense in this case. > By searching the code, I didn't see nd->nt_prstatus_percpu[cpu] except > for any cpu shifts. > > However I cannot find out the git history of why variable 'j' was > introduced. So this patch looks correct to me, though I haven't run > the overall regression test against it. What's your opinion on the > variable 'j'? In addition, I can do a regression test if needed. > @Lianbo Jiang > > The variable 'j' was introduced into crash utility with the related ppc64 patch changes together, however this changed the corresponding relationship between the cpu and nt_prstatus_percpu[], in addition, the num_prstatus_notes was also changed(only include online cpus' notes).
For ppc64, it has a fadump case, which does not depend on the crash_notes. I guess we can not remove it simply, and need to consider it more: [1] why did the num_prstatus_notes need to be changed? [2] why did we copy the values of nt_prstatus_percpu[] to another array nt_ptr instead of changing the values on it directly? Can the variable 'nt_ptr' be dropped? [3] how to handle the code related to this? There is another similar code in diskdump.c, which seems need to be > updated as well. > It's true. I also saw the similar code in ppc. Thanks Lianbo > Thanks, > Tao Liu > > > >> > nd->num_prstatus_notes = > >> > MAX(nd->num_prstatus_notes, i+1); > >> > } > >> > -- > >> > 2.34.1 > >> > > >> > Guanyou. > >> > Thanks. > >> > >
-- Crash-utility mailing list -- devel@lists.crash-utility.osci.io To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki