Re: [Crash-utility] [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 getting crash notes

2019-12-13 Thread
Hi Dave,

I test your attached patch, it achieves the proper expectation for your 
modifies .

WARNING: cpu 0: invalid NT_PRSTATUS note (name != "CORE")
WARNING: cpu 3: invalid NT_PRSTATUS note (name != "CORE")
WARNING: cpu 4: invalid NT_PRSTATUS note (name != "CORE")
please wait... (determining panic task)
WARNING: cannot determine starting stack frame for task ffdd0066cd80

WARNING: cannot determine starting stack frame for task ffdd0066ae80

WARNING: cannot determine starting stack frame for task ffdcb7e40f80

crash> bt -a
PID: 244TASK: ffdcb7e40f80  CPU: 0   COMMAND: "hang_detect"
bt: WARNING: cannot determine starting stack frame for task ffdcb7e40f80

PID: 0  TASK: ffdd0066dd00  CPU: 1   COMMAND: "swapper/1"
 #0 [ffdd3fe4ea40] mrdump_stop_noncore_cpu at ffa5f6932a0c
 #1 [ffdd3fe4ebc0] flush_smp_call_function_queue at ffa5f58b8684
 #2 [ffdd3fe4ec20] generic_smp_call_function_single_interrupt at 
ffa5f58ba8c4
 #3 [ffdd3fe4ec30] handle_IPI at ffa5f56a8a10
 #4 [ffdd3fe4eca0] gic_handle_irq at ffa5f5682294
---  ---
 #5 [ffdd006c7de0] el1_irq at ffa5f56844e4
 PC: ffa5f7120b28  [cpuidle_enter_state+336]
 LR: ffa5f7120d70  [cpuidle_enter_state+920]
 SP: ffdd006c7df0  PSTATE: 10c00145
X29: ffdd006c7df0  X28: 0001  X27: 0001
X26: ffa5fa04e000  X25: 0001  X24: 378c55cbccab
X23: ffa5f9970980  X22: 0001  X21: ffdcb6f6a400
X20: ffa5fa04efa8  X19: 378c56210393  X18: 
X17:   X16:   X15: 
X14: ffdd0066dd00  X13: 0037475a6000  X12: 3455591d
X11:   X10: 1000   X9: 
 X8: ff8ba00d8f6c   X7:    X6: 1ff4bf57ac45
 X5: 00209246a095bf14   X4: 348a8d795b93   X3: 431bde82d7b634db
 X2: 1ffba00cdba3   X1:    X0: 
 #6 [ffdd006c7df0] cpuidle_enter_state at ffa5f7120b24
 #7 [ffdd006c7e70] cpuidle_enter at ffa5f712150c
 #8 [ffdd006c7e80] call_cpuidle at ffa5f5805e24
 #9 [ffdd006c7eb0] do_idle at ffa5f5806320
#10 [ffdd006c7f80] cpu_startup_entry at ffa5f5806910
#11 [ffdd006c7fa0] secondary_start_kernel at ffa5f56a7fac

PID: 0  TASK: ffdd00669f00  CPU: 2   COMMAND: "swapper/2"
 #0 [ffdd3fe6ca40] mrdump_stop_noncore_cpu at ffa5f6932a0c
 #1 [ffdd3fe6cbc0] flush_smp_call_function_queue at ffa5f58b8684
 #2 [ffdd3fe6cc20] generic_smp_call_function_single_interrupt at 
ffa5f58ba8c4
 #3 [ffdd3fe6cc30] handle_IPI at ffa5f56a8a10
 #4 [ffdd3fe6cca0] gic_handle_irq at ffa5f5682294
---  ---
 #5 [ffdd006cfde0] el1_irq at ffa5f56844e4
 PC: ffa5f7120b28  [cpuidle_enter_state+336]
 LR: ffa5f7120d70  [cpuidle_enter_state+920]
 SP: ffdd006cfdf0  PSTATE: 10c00145
X29: ffdd006cfdf0  X28: 0001  X27: 0001
X26: ffa5fa04e000  X25: 0001  X24: 378c5463ce04
X23: ffa5f9970980  X22: 0001  X21: ffdcb6f69b00
X20: ffa5fa04efa8  X19: 378c5621219f  X18: 
X17:   X16:   X15: 
X14: ffdd00669f00  X13: 0037475c4000  X12: 3455591d
X11:   X10: 1000   X9: 
 X8: ff8ba00d9f6c   X7:    X6: 1ff4bf57ac5d
 X5: 00209246a095bf14   X4: 348a8d795b93   X3: 431bde82d7b634db
 X2: 1ffba00cd3e3   X1:    X0: 
 #6 [ffdd006cfdf0] cpuidle_enter_state at ffa5f7120b24
 #7 [ffdd006cfe70] cpuidle_enter at ffa5f712150c
 #8 [ffdd006cfe80] call_cpuidle at ffa5f5805e24
 #9 [ffdd006cfeb0] do_idle at ffa5f5806320
#10 [ffdd006cff80] cpu_startup_entry at ffa5f580690c
#11 [ffdd006cffa0] secondary_start_kernel at ffa5f56a7fac

PID: 0  TASK: ffdd0066cd80  CPU: 3   COMMAND: "swapper/3"
bt: WARNING: cannot determine starting stack frame for task ffdd0066cd80

PID: 0  TASK: ffdd0066ae80  CPU: 4   COMMAND: "swapper/4"
bt: WARNING: cannot determine starting stack frame for task ffdd0066ae80

..

Best regards,
Qiwu

-Original Message-
From: Dave Anderson 
Sent: Saturday, December 14, 2019 12:27 AM
To: 陈启武 
Cc: crash-utility@redhat.com
Subject: Re: [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 
getting crash notes



- Original Message -
> Hi Dave,
> I have mistaken understand about the first point of view, thanks for correct 
> my fault.
> But how about my second point about the optimization for ARM64 getting crash 
> notes?
> 2) arm64_get_crash_notes() check the sanity of NT_P

Re: [Crash-utility] [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 getting crash notes

2019-12-12 Thread
a5f5806910
#11 [ffdd006c7fa0] secondary_start_kernel at ffa5f56a7fac

PID: 0  TASK: ffdd00669f00  CPU: 2   COMMAND: "swapper/2"
 #0 [ffdd3fe6ca40] mrdump_stop_noncore_cpu at ffa5f6932a0c
 #1 [ffdd3fe6cbc0] flush_smp_call_function_queue at ffa5f58b8684
 #2 [ffdd3fe6cc20] generic_smp_call_function_single_interrupt at 
ffa5f58ba8c4
 #3 [ffdd3fe6cc30] handle_IPI at ffa5f56a8a10
 #4 [ffdd3fe6cca0] gic_handle_irq at ffa5f5682294
---  ---
 #5 [ffdd006cfde0] el1_irq at ffa5f56844e4
 PC: ffa5f7120b28  [cpuidle_enter_state+336]
 LR: ffa5f7120d70  [cpuidle_enter_state+920]
 SP: ffdd006cfdf0  PSTATE: 10c00145
X29: ffdd006cfdf0  X28: 0001  X27: 0001
X26: ffa5fa04e000  X25: 0001  X24: 378c5463ce04
X23: ffa5f9970980  X22: 0001  X21: ffdcb6f69b00
X20: ffa5fa04efa8  X19: 378c5621219f  X18: 
X17:   X16:   X15: 
X14: ffdd00669f00  X13: 0037475c4000  X12: 3455591d
X11:   X10: 1000   X9: 
 X8: ff8ba00d9f6c   X7:    X6: 1ff4bf57ac5d
 X5: 00209246a095bf14   X4: 348a8d795b93   X3: 431bde82d7b634db
 X2: 1ffba00cd3e3   X1:    X0: 
 #6 [ffdd006cfdf0] cpuidle_enter_state at ffa5f7120b24
 #7 [ffdd006cfe70] cpuidle_enter at ffa5f712150c
 #8 [ffdd006cfe80] call_cpuidle at ffa5f5805e24
 #9 [ffdd006cfeb0] do_idle at ffa5f5806320
#10 [ffdd006cff80] cpu_startup_entry at ffa5f580690c
#11 [ffdd006cffa0] secondary_start_kernel at ffa5f56a7fac
...

Best regards,
Qiwu



-Original Message-
From: Dave Anderson 
Sent: Friday, December 13, 2019 4:38 AM
To: 陈启武 
Cc: crash-utility@redhat.com
Subject: Re: [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 
getting crash notes



- Original Message -
> Hi Dave,
> Above your suggestion, I made changes for patch v2.
>
> Best regards,
> Qiwu

Sorry, but I'm going to NAK this patch in its current form.  I'm not exactly 
sure what you're trying to accomplish, but in my testing, it breaks things 
unnecessarily.

For example, moving the call to arm64_get_crash_notes() from POST_VM to 
POST_INIT breaks this logic in task_init():

645 else {
646 if (KDUMP_DUMPFILE())
647 map_cpus_to_prstatus();
648 else if (ELF_NOTES_VALID() && DISKDUMP_DUMPFILE())
649 map_cpus_to_prstatus_kdump_cmprs();
650 please_wait("determining panic task");
651 set_context(get_panic_context(), NO_PID);
652 please_wait_done();
653 }

The get_panic_context() call requires that the ARM64 arm64_get_crash_notes() 
has already been called, and that ms->panic_task_regs[] array has been 
allocated and populated.  But with your patch applied, it has not been called 
yet, and therefore ms->page_task_regs is still NULL when get_panic_context() is 
called above.

As a result, for example, on a compressed kdump clone created by virsh dump, I 
now see this during initialization:

  please wait... (determining panic task)
  WARNING: cannot determine starting stack frame for task 08c25280

  WARNING: cannot determine starting stack frame for task 8000fa01b000

  WARNING: cannot determine starting stack frame for task 8000fa01c000

  WARNING: cannot determine starting stack frame for task 8000fa01d000

But all of those active tasks have NT_PRSTATUS notes and backtraces:

  crash> help -D | grep PC:
 LR: 08085938   SP: 08be3ee0   PC: 
08099cf8
 LR: 08085938   SP: 8000fa06ff30   PC: 
08099cf8
 LR: 08085938   SP: 8000fa073f30   PC: 
08099cf8
 LR: 08085938   SP: 8000fa077f30   PC: 
08099cf8
  crash>

And therefore have legitimate starting stack frames:

  crash> bt 08c25280 8000fa01b000 8000fa01c000 8000fa01d000
  PID: 0  TASK: 08c25280  CPU: 0   COMMAND: "swapper/0"
   #0 [08be3ee0] cpu_do_idle at 08099cf4
   #1 [08be3f10] default_idle_call at 08766f90
   #2 [08be3f20] cpu_startup_entry at 08110fe4
   #3 [08be3f80] rest_init at 08761674
   #4 [08be3f90] start_kernel at 08ab0be8

  PID: 0  TASK: 8000fa01b000  CPU: 1   COMMAND: "swapper/1"
   #0 [8000fa06ff30] cpu_do_idle at 08099cf4
   #1 [8000fa06ff60] default_idle_call at 08766f90
   #2 [8000fa06ff70] cpu_startup_entry at 08110fe4
   #3 [8000f

Re: [Crash-utility] [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 getting crash notes

2019-12-11 Thread
Hi Dave,
Above your suggestion, I made changes for patch v2.

Best regards,
Qiwu

-Original Message-
From: Dave Anderson 
Sent: Wednesday, December 11, 2019 12:51 AM
To: qiwuche...@gmail.com
Cc: crash-utility@redhat.com; 陈启武 
Subject: [External Mail]Re: [PATCH] Bugfix and optimization for ARM64 getting 
crash notes



- Original Message -
> From: chenqiwu 
>
> 1) ARM64 call arm64_get_crash_notes() to retrieve active task
> registers when POST_VM before calling map_cpus_to_prstatus() to remap
> the NT_PRSTATUS elf notes to the online cpus. It's better to call
> arm64_get_crash_notes() when POST_INIT.
> 2) arm64_get_crash_notes() check the sanity of NT_PRSTATUS notes only
> for online cpus. If one cpu contains invalid note, it's better to
> continue finding the crash notes for other online cpus.
> So we can extract the backtraces for the online cpus which contain
> valid note by using command "bt -a".
> 3) map_cpus_to_prstatus() remap the NT_PRSTATUS notes only to the
> online cpus. Make sure there must be a one-to-one relationship between
> the number of online cpus and the number of notes.

The code in map_cpus_to_prstatus() and map_cpus_to_prstatus_kdump_cmprs()
has been in place forever.  Both the nd->nt_prstatus_percpu[] and
dd->nt_prstatus_percpu[] arrays are per-cpu regardless whether
they are online or offline.  However, since kdump only creates NT_PRSTATUS 
notes for on-line cpus, the "i" index is needed for each cpu, and the "j" index 
is needed for the existing NT_PRSTATUS notes.  If a cpu is offline, its 
nt_prstatus_percpu[] entry will be zeroed out.

I'm not arguing that the arm64 online-cpu handling may be suspect, but your 
patch should not be making changes to architectural-neutral code unless the 
issue affects all architectures.  So please leave those two functions alone.

Dave


>
> Signed-off-by: chenqiwu 
> ---
>  arm64.c| 49 +
>  diskdump.c |  9 +++--
>  netdump.c  |  4 ++--
>  3 files changed, 34 insertions(+), 28 deletions(-)
>
> diff --git a/arm64.c b/arm64.c
> index 233029d..cbad461 100644
> --- a/arm64.c
> +++ b/arm64.c
> @@ -458,7 +458,7 @@ arm64_init(int when)
>  arm64_stackframe_init()
>  break;
>
> -case POST_VM:
> +case POST_INIT:
>  /*
>   * crash_notes contains machine specific information about the
>   * crash. In particular, it contains CPU registers at the time @@
> -3587,7 +3587,7 @@ arm64_get_crash_notes(void)
>  ulong offset;
>  char *buf, *p;
>  ulong *notes_ptrs;
> -ulong i;
> +ulong i, j;
>
>  if (!symbol_exists("crash_notes"))
>  return FALSE;
> @@ -3620,12 +3620,12 @@ arm64_get_crash_notes(void)
>  if (!(ms->panic_task_regs = calloc((size_t)kt->cpus, sizeof(struct
>  arm64_pt_regs
>  error(FATAL, "cannot calloc panic_task_regs space\n");
>
> -for  (i = 0; i < kt->cpus; i++) {
> -
> +for  (i = 0, j = 0; i < kt->cpus; i++) {
>  if (!readmem(notes_ptrs[i], KVADDR, buf, SIZE(note_buf),
>  "note_buf_t", RETURN_ON_ERROR)) {
> -error(WARNING, "failed to read note_buf_t\n");
> -goto fail;
> +error(WARNING, "cpu#%d: failed to read note_buf_t\n", i);
> +++j;
> +continue;
>  }
>
>  /*
> @@ -3655,19 +3655,29 @@ arm64_get_crash_notes(void)
>  note->n_descsz == notesz)
>  BCOPY((char *)note, buf, notesz);
>  } else {
> -error(WARNING,
> -"cannot find NT_PRSTATUS note for cpu: %d\n", i);
> +if (CRASHDEBUG(1))
> +error(WARNING,
> +"cpu#%d: cannot find NT_PRSTATUS note\n", i);
> +++j;
>  continue;
>  }
>  }
>
> +/*
> + * Check the sanity of NT_PRSTATUS note only for each online cpu.
> + * If this cpu has invalid note, continue to find the crash notes
> + * for other online cpus.
> + */
>  if (note->n_type != NT_PRSTATUS) {
> -error(WARNING, "invalid note (n_type != NT_PRSTATUS)\n");
> -goto fail;
> +error(WARNING, "cpu#%d: invalid note (n_type != NT_PRSTATUS)\n", i);
> +++j;
> +continue;
>  }
> -if (p[0] != 'C' || p[1] != 'O' || p[2] != 'R' || p[3] != 'E') {
> -error(WARNING, "invalid note (name != \"CORE\"\n");
> -goto fail;
> +
> +if (!STRNEQ(p, "CORE")) {
> +error(WARNING, "cpu#%d: invalid note (name != \"CORE\")\n", i);
> +++j;
> +continue;
>  }
>
>  /*
> @@ -3684,14 +3694,13 @@ arm64_get_crash_notes(void)
>
>  FREEBUF(buf);
>  FREEBUF(notes_ptrs);
> -return TRUE;
>
> -fail:
> -FREEBUF(buf);
> -FREEBUF(notes_ptrs);
> -free(ms->panic_task_regs);
> -ms->panic_task_regs = NULL;
> -return FALSE;
> +if (j == kt->cpus) {
> +free(ms->panic_task_regs);
>

Re: [Crash-utility] [External Mail]Re: [PATCH] Fix a potential segfault for the ARM64 "bt -S " command

2019-11-05 Thread
Hi Dave,
1. Is this a kdump-generated dumpfile?
It's a kdump-generated dumpfile for arm64.

2. Have you looked into why you get the "bt: WARNING: cannot determine starting 
stack frame for task ffcd74122000" message?
Because kernel didn't enable crash_notes symbol to save active task regs.

3. You didn't show the results of your patch -- if you apply it, does the 
backtrace get displayed correctly?
>From the result of my patch, it shows bade stack frame for sp address 
>0xff800c42ba00.
crash> bt -S ff800c42ba00 108
PID: 108TASK: ffcd74122000  CPU: 5   COMMAND: "rtmm_reclaim"
bt: WARNING: cannot determine starting stack frame for task ffcd74122000
 #0 [ff9c29e4fa10] (null) at fffc

4. Since the "bt -S" option is almost never used.  Would it be possible to 
restrict your patch to fix/verify things in the section where it handles the 
bt->hp->sp setting?
I add the following change to print where it handles the bt->hp->sp setting:
--- a/arm64.c
+++ b/arm64.c
@@ -2542,7 +2542,8 @@ arm64_back_trace_cmd(struct bt_info *bt)
 *   x+8: contains stackframe.pc -- text return address
 *  x+16: is the stackframe.sp address
 */
-
+   fprintf(stderr, "bt:flags=%llx, bptr=%lx, eip=%lx, esp=%lx, stkptr=%lx, 
instptr=%lx, frameptr=%lx\n",
+   bt->flags, bt->bptr, bt->hp->eip, bt->hp->esp, bt->stkptr, 
bt->instptr, bt->frameptr);
if (bt->flags & BT_KDUMP_ADJUST) {
if (arm64_on_irq_stack(bt->tc->processor, bt->bptr)) {
arm64_set_irq_stack(bt);
@@ -2572,6 +2573,7 @@ arm64_back_trace_cmd(struct bt_info *bt)
stackframe.fp = bt->frameptr;
}

+   fprintf(stderr, "stackframe:sp=%lx, pc=%lx, fp=%lx\n", stackframe.sp, 
stackframe.pc, stackframe.fp);
if (bt->flags & BT_TEXT_SYMBOLS) {
arm64_print_text_symbols(bt, , ofp);
 if (BT_REFERENCE_FOUND(bt)) {

The result shows as below:
crash> bt -S ff800c42ba00 108
bt:flags=4, bptr=0, eip=0, esp=ff800c42ba00, 
stkptr=ff800c42ba00, instptr=0, frameptr=0
stackframe:sp=ff800c42ba08, pc=0, fp=ff9c29e4fa10

It seems invalid stackframe.sp and pc calculated by 
GET_STACK_ULONG(bt->hp->esp). I think it must be resulted from invalid 
bt->stackbuf address.
(gdb) p /x *(struct bt_info *) 0x7fffd640
$4 = {task = 0xffcd74122000, flags = 0x0, instptr = 0x0, stkptr = 
0xff800c42ba00, bptr = 0x0, stackbase = 0xff800c428000,
  stacktop = 0xff800c42c000, stackbuf = 0x55f23ae0, tc = 
0x596e1778, hp = 0x7fffd5f0, textlist = 0x0, ref = 0x0, frameptr = 0x0,
  call_target = 0x0, machdep = 0x0, debug = 0x0, eframe_ip = 0x0, radix = 0x0, 
cpumask = 0x0}

so this is the reason for that matter what is the stackframe.pc and 
stackframe.fp.

Best regards,
Qiwu



-Original Message-
From: Dave Anderson 
Sent: Monday, November 4, 2019 11:39 PM
To: Discussion list for crash utility usage, maintenance and development 

Cc: 陈启武 
Subject: [External Mail]Re: [Crash-utility] [PATCH] Fix a potential segfault 
for the ARM64 "bt -S " command



- Original Message -

> > The stackframe.fp(0xff9c29e4f8e0) is larger than the stacktop
> > address, so lead to segmentation violation gernarated by accessing regs->sp:
> > (gdb) p /x 18446743644915693792//stkptr
> > $5 = 0xff9c29e4f8e0
> > (gdb) p /x
> > 0xff9c29e4f8e0-0xff800c428000//STACK_OFFSET_TYPE(stkptr)
> > $6 = 0x1c1da278e0
> > (gdb) p /x regs
> > $7 = 0x55717394b3c0
> > (gdb) p *(struct arm64_pt_regs *) 0x55717394b3c0 Cannot access
> > memory at address 0x55717394b3c0
> >
> > For fix this, I think it must be add a condition 
> > "arm64_in_exception_text(stackframe.pc) && INSTACK(stackframe.fp, bt)"
> > to avoid an invalid exception frame before transitioning to the process 
> > stack.

Or alternatively, would it be better to have arm64_is_kernel_exception_frame() 
verify that the "regs" assignment is legitimate, and if not, just return FALSE?

Dave

#/**本邮件及其附件含有小米公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
 This e-mail and its attachments contain confidential information from XIAOMI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!**/#

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

[Crash-utility] [PATCH] Fix a potential segfault for the ARM64 "bt -S " command

2019-11-04 Thread
Hi Dave,
I‘m working on arm64 kdump by crash-7.2.7++.
There is a potential segmentation violation due to an invalid exception frame 
before
transitioning to the process stack when try using the bt command's "-S 
" options.
For example, take the sp argument from the log:
[   84.048650] pc : _raw_spin_lock+0x30/0x88
[   84.048658] lr : lowmem_scan+0x45c/0xbd0
[   84.048661] sp : ff800c42ba00 pstate : 20c00145

A segmentation violation is generated as below:
crash> bt -S ff800c42ba00 108
PID: 108TASK: ffcd74122000  CPU: 5   COMMAND: "rtmm_reclaim"
bt: WARNING: cannot determine starting stack frame for task ffcd74122000

Program received signal SIGSEGV, Segmentation fault.
0x5572de2b in arm64_is_kernel_exception_frame (bt=0x7fffd640, 
stkptr=18446743644915693792) at arm64.c:1785
warning: Source file is more recent than executable.
1785if (INSTACK(regs->sp, bt) && INSTACK(regs->regs[29], bt) &&
(gdb) bt
#0  0x5572de2b in arm64_is_kernel_exception_frame (bt=0x7fffd640, 
stkptr=18446743644915693792) at arm64.c:1785
#1  0x5572ffaf in arm64_back_trace_cmd (bt=0x7fffd640) at 
arm64.c:2594
#2  0x556ef0c4 in back_trace (bt=0x7fffd640) at kernel.c:3164
#3  0x556ed624 in cmd_bt () at kernel.c:2833
#4  0x5564a73b in exec_command () at main.c:879
#5  0x5564a515 in main_loop () at main.c:826
#6  0x558b5b43 in captured_command_loop (data=data@entry=0x0) at 
main.c:258
#7  0x558b46ca in catch_errors (func=func@entry=0x558b5b30 
, func_args=func_args@entry=0x0,
errstring=errstring@entry=0x55b0b728 "", mask=mask@entry=6) at 
exceptions.c:557
#8  0x558b6c42 in captured_main (data=data@entry=0x7fffdfb0) at 
main.c:1064
#9  0x558b46ca in catch_errors (func=func@entry=0x558b5e70 
, func_args=func_args@entry=0x7fffdfb0,
errstring=errstring@entry=0x55b0b728 "", mask=mask@entry=6) at 
exceptions.c:557
#10 0x558b702e in gdb_main (args=0x7fffdfb0) at main.c:1079
#11 gdb_main_entry (argc=, argv=) at main.c:1099
#12 0x5570fc53 in gdb_main_loop (argc=2, argv=0x7fffe148) at 
gdb_interface.c:76
#13 0x5564a1e0 in main (argc=4, argv=0x7fffe148) at main.c:707

(gdb) p /x *(struct bt_info *) 0x7fffd640
$4 = {task = 0xffcd74122000, flags = 0x0, instptr = 0x0, stkptr = 
0xff800c42ba00, bptr = 0x0, stackbase = 0xff800c428000,
  stacktop = 0xff800c42c000, stackbuf = 0x55f23ae0, tc = 
0x596e1778, hp = 0x7fffd5f0, textlist = 0x0, ref = 0x0, frameptr = 0x0,
  call_target = 0x0, machdep = 0x0, debug = 0x0, eframe_ip = 0x0, radix = 0x0, 
cpumask = 0x0}

The stackframe.fp(0xff9c29e4f8e0) is larger than the stacktop address, so 
lead to segmentation violation gernarated by accessing regs->sp:
(gdb) p /x 18446743644915693792//stkptr
$5 = 0xff9c29e4f8e0
(gdb) p /x 0xff9c29e4f8e0-0xff800c428000//STACK_OFFSET_TYPE(stkptr)
$6 = 0x1c1da278e0
(gdb) p /x regs
$7 = 0x55717394b3c0
(gdb) p *(struct arm64_pt_regs *) 0x55717394b3c0
Cannot access memory at address 0x55717394b3c0

For fix this, I think it must be add a condition 
"arm64_in_exception_text(stackframe.pc) && INSTACK(stackframe.fp, bt)" to avoid
an invalid exception frame before transitioning to the process stack.

The patch file has been upload to attachment.
Thanks for your review. I’m looking forward to your favourable reply!

Best regards,
Qiwu

#/**本邮件及其附件含有小米公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
 This e-mail and its attachments contain confidential information from XIAOMI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!**/#


0001-Fix-a-potential-segfault-for-the-ARM64-bt-S-stack-ad.patch
Description: 0001-Fix-a-potential-segfault-for-the-ARM64-bt-S-stack-ad.patch
--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Re: [Crash-utility] [External Mail]Re: [PATCH] optimize the way to find the panic task.

2019-10-22 Thread
Hi Dave,
Thanks for your review. It's a great honor to give me some valuable suggestions 
about my patch.
I have different points of view about the definition of max logbuf length.
In upstream kernel, the max logbuf length is still determined by arch-specific 
CONFIG_LOG_BUF_SHIFT definition.
[kernel/printk/printk.c]
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);

LOG_BUF_LEN_MAX comes from commit e6fe3e5b7d16e8f146a4ae7fe481bc6e97acde1e, 
which give error on
attempt to set log buffer length to over 2G.

For the question where I got the MAX_BUFSIZE of 2MB?
I'm working on QCOM's ARM64 arch. In QCOM's kernel-4.14 code, the max logbuf 
length is set to 2MB.
[arch/arm64/configs/xxx_config]
CONFIG_LOG_BUF_SHIFT=21

Above your suggestions, I have correct the logic error and made some 
significant changes for my patch.
The new patch file has been upload to attachment.
Thanks for your review. I’m looking forward to your favourable reply!

Best regards,
Qiwu


-Original Message-
From: Dave Anderson 
Sent: Thursday, October 17, 2019 4:33 AM
To: 陈启武 
Subject: [External Mail]Re: [PATCH] optimize the way to find the panic task.


Hi Qiwu,

I tested your patch against several ARM64 dumpfiles that I have on hand, and a 
couple of them that were created with "virsh dump" generated a segmentation 
violations like this:

...
please wait... (determining panic task)
Program received signal SIGSEGV, Segmentation fault.
0x76bba0b0 in __strstr_sse42 () from /lib64/libc.so.6 Missing separate 
debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64 
libgcc-4.8.5-36.el7_6.2.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64 
lzo-2.06-8.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 
snappy-1.1.0-3.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x76bba0b0 in __strstr_sse42 () from /lib64/libc.so.6
#1  0x004b6389 in get_log_panic_task () at task.c:7485
#2  0x004ccefe in panic_search () at task.c:7361
#3  get_panic_context () at task.c:6205
#4  task_init () at task.c:642
#5  0x00460be5 in main_loop () at main.c:774
#6  0x00659383 in captured_command_loop (data=data@entry=0x0) at 
main.c:258
#7  0x006580aa in catch_errors (func=func@entry=0x659370 
, func_args=func_args@entry=0x0,
errstring=errstring@entry=0x890f87 "", mask=mask@entry=6) at 
exceptions.c:557
#8  0x0065a316 in captured_main (data=data@entry=0x7fffdb30) at 
main.c:1064
#9  0x006580aa in catch_errors (func=func@entry=0x659650 
, func_args=func_args@entry=0x7fffdb30,
errstring=errstring@entry=0x890f87 "", mask=mask@entry=6) at 
exceptions.c:557
#10 0x0065a677 in gdb_main (args=0x7fffdb30) at main.c:1079
#11 gdb_main_entry (argc=, argv=argv@entry=0x7fffdc98) at 
main.c:1099
#12 0x004eeab4 in gdb_main_loop (argc=, argc@entry=3, 
argv=argv@entry=0x7fffdc98) at gdb_interface.c:76
#13 0x0045f03a in main (argc=3, argv=0x7fffdc98) at main.c:707
(gdb)

The SIGSEGV is the strstr() call in get_log_panic_task(), where the "buf"
pointer must be OK, but for some reason "i" is not available in the gdb 
session.  So I added the following debug line:

   7479 BZERO(buf, MAX_BUFSIZE);
   7480 open_tmpfile();
   7481 dump_log(SHOW_LOG_TEXT);
   7482 rewind(pc->tmpfile);
   7483 if (fread(buf, 1, MAX_BUFSIZE, pc->tmpfile)) {
   7484 while (panic_keywords[i++]) {
   7485 fprintf(stderr, "[%d][%s]\n", i, panic_keywords[i]);
   7486 if ((p1 = strstr(buf, panic_keywords[i]))) {
   7487 if ((p1 = strstr(p1, "CPU: "))) {
   7488 p1 += strlen("CPU: ");
   7489 p2 = p1;
   7490


and as expected, it runs off the end of the panic_keywords[] array:

  ...

  please wait... (determining panic task)[1][BUG: unable to handle kernel]
  [2][Kernel BUG at]
  [3][kernel BUG at]
  [4][Bad mode in]
  [5][Oops]
  [6][Kernel panic]
  [7][(null)]
  Segmentation fault (core dumped)
  $

But anyway, aside from the logic error above, a couple other comments:

(1) I do not want to change the order in which the panic task
search is made -- it should still try "foreach bt" first,
and only if that fails, search the log.

(2) The upstream kernel has a LOG_BUF_LEN_MAX that is 2GB,
so I'm not sure where you got the MAX_BUFSIZE of 2MB?

(3) But regardless of the log buffer size, I don't like the idea
of reading the whole log into a buffer.  It's already captured
into a temporary file that can be searched, so why bother copying
it into another buffer?

I would suggest using "while (fgets(buf, BUFSIZE, pc->tmpfile))"
instead.  BUFSIZE should be large enough to contain any line in the log buffe

Re: [Crash-utility] [External Mail]Re: [PATCH] Fix for the determination of the ARM64 SECTION_SIZE_BITS

2019-09-18 Thread
Hi Dave,
I resend the email with the patch attached. It’s a bugfix for the determination 
of the ARM64 SECTION_SIZE_BITS.
For some ARM64 arch platforms such as qcom 8150/8250 boards, they define their 
own section size bits in kernel config to support
memory hotplug, which is not same as ARM64 default SECTION_SIZE_BITS value: 30. 
The SECTION_SIZE_BITS definition in qcom's linux-4.14 code as below:
[arch/arm64/include/asm/sparsemem.h]
#ifdef CONFIG_SPARSEMEM
#define MAX_PHYSMEM_BITS48
#ifndef CONFIG_MEMORY_HOTPLUG
#define SECTION_SIZE_BITS   30
#else
#define SECTION_SIZE_BITS   CONFIG_HOTPLUG_SIZE_BITS
#endif
#endif

Without the patch, after load ramdump by crash tool, the warning message maybe 
displayed:
crash> kmem -P 0x8000
kmem: WARNING: cannot find mem_map page for address: fffe8000
kmem: cannot determine page for 8000
8000: physical address not found in mem map

So I introduce this patch to fix it, we can determine SECTION_SIZE_BITS either 
by reading VMCOREINFO or the kernel config,
otherwise borrow the 64-bit ARM default definiton.
diff --git a/arm64.c b/arm64.c
index 6b34b5f..31c03d4 100644
--- a/arm64.c
+++ b/arm64.c
@@ -32,6 +32,7 @@ static int verify_kimage_voffset(void);
static void arm64_calc_kimage_voffset(void);
static void arm64_calc_phys_offset(void);
static void arm64_calc_virtual_memory_ranges(void);
+static void arm64_get_section_size_bits(void);
static int arm64_kdump_phys_base(ulong *);
static ulong arm64_processor_speed(void);
static void arm64_init_kernel_pgd(void);
@@ -375,7 +376,11 @@ arm64_init(int when)

case POST_GDB:
arm64_calc_virtual_memory_ranges();
-   machdep->section_size_bits = _SECTION_SIZE_BITS;
+   arm64_get_section_size_bits();
+   if (CRASHDEBUG(1)) {
+   fprintf(fp, "SECTION_SIZE_BITS: %ld\n", 
machdep->section_size_bits);
+   }
+
if (!machdep->max_physmem_bits) {
if ((string = 
pc->read_vmcoreinfo("NUMBER(MAX_PHYSMEM_BITS)"))) {
machdep->max_physmem_bits = atol(string);
@@ -1055,6 +1060,32 @@ arm64_calc_phys_offset(void)
fprintf(fp, "using %lx as phys_offset\n", ms->phys_offset);
}

+/*
+ *  Determine SECTION_SIZE_BITS either by reading VMCOREINFO or the kernel
+ *  config, otherwise borrow the 64-bit ARM default definiton.
+ */
+static void
+arm64_get_section_size_bits(void)
+{
+   int ret;
+   char *string;
+
+   if ((string = pc->read_vmcoreinfo("NUMBER(SECTION_SIZE_BITS)"))) {
+   machdep->section_size_bits = atol(string);
+   free(string);
+   return;
+   }
+
+   if ((ret = get_kernel_config("CONFIG_MEMORY_HOTPLUG", NULL)) == 
IKCONFIG_Y) {
+   if ((ret = get_kernel_config("CONFIG_HOTPLUG_SIZE_BITS", 
)) == IKCONFIG_STR) {
+   machdep->section_size_bits = atol(string);
+   free(string);
+   return;
+   }
+   }
+
+   machdep->section_size_bits = _SECTION_SIZE_BITS;
+}

Thanks for your review. I’m looking forward to your favourable reply!

Best regards,
Qiwu



-----Original Message-
From: Dave Anderson 
Sent: Wednesday, September 18, 2019 9:32 PM
To: 陈启武 
Cc: crash-utility@redhat.com
Subject: Re: [External Mail]Re: [PATCH] Fix a segfault in setup_ikconfig.



- Original Message -
> Hi Dave,
> 1. I'm requesting the arm64.c change as well. Because some chip makers
> would like to define their own section size bits in kernel config to
> support memory hotplug, which is not same as ARM64 default SECTION_SIZE_BITS 
> value:
> 30, such as QCOM SDM8150 or 8250 borads. Although linux kernel
> maintain line still define default SECTION_SIZE_BITS value: 30 as
> section_size_bits for sparsemem. The patch has detailed descript in
> the email "[PATCH] Fix for the determination of the ARM64 SECTION_SIZE_BITS".

I never received an email with that header.  Can you please resend the email 
with the patch attached (i.e., not inlined)?

Thanks,
  Dave

> 2. The output of "sys config" for my dumpfile has been uploaded, which
> shows the last line is empty. it occurs a segment fault when add the
> last empty line into ikconfig entry. I agree that the use of STRNEQ()
> is better than strncmp().
>
> Best regards,
> Qiwu
>
>
>
> -Original Message-
> From: Dave Anderson 
> Sent: Wednesday, September 18, 2019 3:22 AM
> To: 陈启武 
> Cc: crash-utility@redhat.com
> Subject: [External Mail]Re: [PATCH] Fix a segfault in setup_ikconfig.
>
>
>
> - Original Message -
> >
> > Hi Anderson,
> > I want to introduce a patch to your crash tool project. It’s a
> > bugfix for a segfault 

[Crash-utility] [PATCH] Fix a segfault in setup_ikconfig.

2019-09-07 Thread

Hi Anderson,
I want to introduce a patch to your crash tool project. It’s a bugfix for a 
segfault in setup_ikconfig.
We add an ikconfig entry only if ent[0] != '#', it is not an advisable 
condition because there is a potential segfault risk if ent is gibberish.
I explain the reproducing steps about this segfault case:
I try to apply the following patch to crash 7.2.6++ code for a test.
--- a/arm64.c
+++ b/arm64.c
@@ -32,6 +32,7 @@ static int verify_kimage_voffset(void);
static void arm64_calc_kimage_voffset(void);
static void arm64_calc_phys_offset(void);
static void arm64_calc_virtual_memory_ranges(void);
+static void arm64_get_section_size_bits(void);
static int arm64_kdump_phys_base(ulong *);
static ulong arm64_processor_speed(void);
static void arm64_init_kernel_pgd(void);
@@ -375,7 +376,11 @@ arm64_init(int when)

case POST_GDB:
arm64_calc_virtual_memory_ranges();
-   machdep->section_size_bits = _SECTION_SIZE_BITS;
+   arm64_get_section_size_bits();
+   if (CRASHDEBUG(1)) {
+   fprintf(fp, "SECTION_SIZE_BITS: %ld\n", 
machdep->section_size_bits);
+   }
+
if (!machdep->max_physmem_bits) {
if ((string = 
pc->read_vmcoreinfo("NUMBER(MAX_PHYSMEM_BITS)"))) {
machdep->max_physmem_bits = atol(string);
@@ -1055,6 +1060,32 @@ arm64_calc_phys_offset(void)
fprintf(fp, "using %lx as phys_offset\n", ms->phys_offset);
}

+/*
+ *  Determine SECTION_SIZE_BITS either by reading VMCOREINFO or the kernel
+ *  config, otherwise borrow the 64-bit ARM default definiton.
+ */
+static void
+arm64_get_section_size_bits(void)
+{
+   int ret;
+   char *string;
+
+   if ((string = pc->read_vmcoreinfo("NUMBER(SECTION_SIZE_BITS)"))) {
+   machdep->section_size_bits = atol(string);
+   free(string);
+   return;
+   }
+
+   if ((ret = get_kernel_config("CONFIG_MEMORY_HOTPLUG", NULL)) == 
IKCONFIG_Y) {
+   if ((ret = get_kernel_config("CONFIG_HOTPLUG_SIZE_BITS", 
)) == IKCONFIG_STR) {
+   machdep->section_size_bits = atol(string);
+   free(string);
+   return;
+   }
+   } else {
+   machdep->section_size_bits = _SECTION_SIZE_BITS;
+   }
+}

Then I make and load the dumpfiles by crash, it occurs a segment fault as below:
crash[31000]: segfault at 0 ip 7f0fb24d98d1 sp 7fff1703f7e8 error 4 in 
libc-2.26.so[7f0fb235b000+1d6000]

So I add debug to find out the segfault reason, It occurred in setup_ikconfig-> 
add_ikconfig_entry.
add_ikconfig_entry: ▒▒▒U//The last ent is a gibberish, lead to 
segfault

I think the most advisable judgement is if an ikconfig entry start with 
"CONFIG_". I debug by the following patch and never reproduce segfault again.
diff --git a/kernel.c b/kernel.c
index 7804aef..d023c87 100644
--- a/kernel.c
+++ b/kernel.c
@@ -10144,7 +10144,7 @@ static int setup_ikconfig(char *config)
while (whitespace(*ent))
ent++;

-   if (ent[0] != '#') {
+   if (!strncmp(ent, "CONFIG_", strlen("CONFIG_"))) {
add_ikconfig_entry(ent,
 _all[kt->ikconfig_ents++]);
if (kt->ikconfig_ents == IKCONFIG_MAX) {

Thanks for your review. I’m looking forward to your favourable reply!

Best regards,
Qiwu




#/**本邮件及其附件含有小米公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
 This e-mail and its attachments contain confidential information from XIAOMI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!**/#


0001-Fix-a-segfault-in-setup_ikconfig.patch
Description: 0001-Fix-a-segfault-in-setup_ikconfig.patch
--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility