Hello, perf record -g dwarf (and perf report) doesn't show correct callchain on aarch64. Here is how to reproduce it.
1) I've prepared an debian8 aarch64 VM on qemu-system-aarch64, and build/install the latest perftools on it. 2) Build attached program as below # gcc -O0 -ggdb3 -funwind-tables -o main main.c 3) Run perf to record with dwarf. # perf record -g --call-graph dwarf,1024 -e cpu-clock:u -o /tmp/perf.data -- ./main ^C[ perf record: Woken up 35 times to write data ] [ perf record: Captured and wrote 8.526 MB /tmp/perf.data (6495 samples) ] 4) Run perf to report the result. # perf report -i /tmp/perf.data # To display the perf.data header info, please use --header/--header-only option # # # Total Lost Samples: 0 # # Samples: 6K of event 'cpu-clock:u' # Event count (approx.): 1623750000 # # Children Self Command Shared Object Symbol # ........ ........ ....... ............. .......................... # 17.21% 17.21% main main [.] func2 | ---func2 17.09% 17.09% main main [.] func1 | ---func1 16.67% 16.67% main main [.] main | ---main ..... So, as you can see, the call graph reported each function has been called from itself. If I report it with fp as below, perf reported correct callgraph. 3') record it with fp # perf record -g --call-graph fp -e cpu-clock:u -o /tmp/perf.data -- ./main ^C[ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.397 MB /tmp/perf.data (4160 samples) ] 4') report it # perf report -i /tmp/perf.data # To display the perf.data header info, please use --header/--header-only option # # # Total Lost Samples: 0 # # Samples: 4K of event 'cpu-clock:u' # Event count (approx.): 1040000000 # # Children Self Command Shared Object Symbol # ........ ........ ....... ............. ........................... # 99.06% 0.00% main libc-2.19.so [.] __libc_start_main | ---__libc_start_main | --98.94%--main | |--80.24%--func0 | | | |--63.27%--func1 | | | | | |--47.04%--func2 | | | | ..... I tried to dump but it seems correct. # perf report -D -i /tmp/perf.data [...] . 0130: c8 05 40 00 00 00 00 00 30 e1 a8 df ff ff 00 00 ..@.....0....... . 0140: 90 05 40 00 00 00 00 00 [00 04 00 00 00 00 00 00](*) ..@............. . 0150: 50 e1 a8 df ff ff 00 00 e8 05 40 00 00 00 00 00 P.........@..... . 0160: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 00 ................ . 0170: 70 e1 a8 df ff ff 00 00 08 06 40 00 00 00 00 00 p.........@..... . 0180: 8c b6 c9 8f ff ff 00 00 00 00 00 00 04 00 00 00 ................ . 0190: 90 e1 a8 df ff ff 00 00 28 06 40 00 00 00 00 00 ........(.@..... . 01a0: c0 e1 a8 df ff ff 00 00 2c 1d b6 8f 02 00 00 00 ........,....... . 01b0: b0 e1 a8 df ff ff 00 00 40 06 40 00 00 00 00 00 ........@.@..... . 01c0: c0 e1 a8 df ff ff 00 00 48 1d b6 8f 01 00 00 00 ........H...... [...] 1207680984048 0xf040 [0x560]: PERF_RECORD_SAMPLE(IP, 0x2): 114/114: 0x400590 per ... FP chain: nr:0 ... user regs: mask 0x1ffffffff ABI 64-bit [...] .... x29 0xffffdfa8e130 .... lr 0x4005c8 .... sp 0xffffdfa8e130 .... pc 0x400590 ... ustack: size 1024, offset 0x148 . data_src: 0x5080021 ... thread: main:114 In this entry, ustack should start from offset=0x0148 in event raw data, I marked it with (*), which is the saved stack size(0x400 = 1024) and the top of stack has 0xffffdfa8e150 which seems next frame pointer, and after that there is 0x4005e8, which is next return address. 00000000004005b0 <func3>: 4005b0: a9be7bfd stp x29, x30, [sp,#-32]! 4005b4: 910003fd mov x29, sp 4005b8: b9001fa0 str w0, [x29,#28] 4005bc: b9401fa0 ldr w0, [x29,#28] 4005c0: 11002000 add w0, w0, #0x8 4005c4: 97fffff3 bl 400590 <func4> 4005c8: a8c27bfd ldp x29, x30, [sp],#32 4005cc: d65f03c0 ret 00000000004005d0 <func2>: 4005d0: a9be7bfd stp x29, x30, [sp,#-32]! 4005d4: 910003fd mov x29, sp 4005d8: b9001fa0 str w0, [x29,#28] 4005dc: b9401fa0 ldr w0, [x29,#28] 4005e0: 11001000 add w0, w0, #0x4 4005e4: 97fffff3 bl 4005b0 <func3> 4005e8: a8c27bfd ldp x29, x30, [sp],#32 So, the stack data should be correct. I guess there is a bug in libunwind on aarch64 or we missed to pass the stack data to libunwind. (BTW, it works correctly on arm32) Thank you, -- Masami Hiramatsu <mhira...@kernel.org>
int buf = 0; void func(int a) { buf += a; } void func4(int a) { func (a+16); } void func3(int a) { func4(a+8); } void func2(int a) { func3(a+4); } void func1(int a) { func2(a+2); } void func0(int a) { func1(a+1); } void main() { do { func0(1); } while (buf); }