Hi folks, I've been trying to track down an issue that started appearing a while back:
[ 73.078526] ------------[ cut here ]------------ [ 73.083194] WARNING: CPU: 2 PID: 4002 at kernel/trace/trace_functions_graph.c:991 print_graph_entry+0x579/0x590 [ 73.093544] Modules linked in: x86_pkg_temp_thermal fuse [ 73.098939] CPU: 2 UID: 0 PID: 4002 Comm: cat Tainted: G S 6.16.0 #1 PREEMPT(voluntary) [ 73.108587] Tainted: [S]=CPU_OUT_OF_SPEC [ 73.112664] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 73.120126] RIP: 0010:print_graph_entry+0x579/0x590 [ 73.125198] Code: 49 89 40 20 49 8b 46 08 49 89 40 28 49 8b 46 10 49 89 40 30 49 8b 46 18 49 89 40 38 49 8b 46 20 49 89 40 40 e9 27 fe ff ff 90 <0f> 0b 90 e9 e2 fe ff ff 90 0f 0b 90 e9 8e fc ff ff e8 91 d8 10 01 [ 73.144001] RSP: 0018:ffffa6af02d37c58 EFLAGS: 00010282 [ 73.149369] RAX: ffffc6aeffd986f0 RBX: ffff9d70c83b0000 RCX: 00000000fefefefe [ 73.156621] RDX: ffffffffbb374080 RSI: 0000000000000001 RDI: ffffffffbaf773ea [ 73.163839] RBP: ffffa6af02d37cf0 R08: ffff9d70c1790cc0 R09: 0000000000000020 [ 73.171023] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9d70c83b2090 [ 73.178216] R13: 0000000000000003 R14: ffff9d70c83b0000 R15: ffff9d70c1790cc0 [ 73.185412] FS: 00007fd8c6872740(0000) GS:ffff9d72741a9000(0000) knlGS:0000000000000000 [ 73.193584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 73.199391] CR2: 00007ffedaf50fbc CR3: 00000001049f4006 CR4: 00000000003726f0 [ 73.206570] Call Trace: [ 73.209086] <TASK> [ 73.211313] ? trace_event_raw_event_preemptirq_template+0x66/0xc0 [ 73.217573] ? __pfx_put_cpu_partial+0x10/0x10 [ 73.222093] ? __pfx_put_cpu_partial+0x10/0x10 [ 73.226635] print_graph_function_flags+0x27c/0x530 [ 73.231607] ? peek_next_entry+0x9d/0xb0 [ 73.235618] print_graph_function+0x13/0x20 [ 73.239895] print_trace_line+0xbb/0x530 [ 73.243909] tracing_read_pipe+0x1d6/0x380 [ 73.248121] vfs_read+0xbb/0x380 [ 73.251495] ? vfs_read+0x9/0x380 [ 73.254929] ksys_read+0x7b/0xf0 [ 73.258258] __x64_sys_read+0x1d/0x30 [ 73.261995] x64_sys_call+0x1ada/0x20d0 [ 73.265936] do_syscall_64+0xb2/0x2b0 [ 73.269694] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 73.274818] RIP: 0033:0x7fd8c6904687 [ 73.278491] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff [ 73.297320] RSP: 002b:00007ffe8bb08e60 EFLAGS: 00000202 ORIG_RAX: 0000000000000000 [ 73.304974] RAX: ffffffffffffffda RBX: 00007fd8c6872740 RCX: 00007fd8c6904687 [ 73.312185] RDX: 0000000000040000 RSI: 00007fd8c6831000 RDI: 0000000000000003 [ 73.319369] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000 [ 73.326588] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fd8c6831000 [ 73.333801] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000 [ 73.341050] </TASK> [ 73.343324] ---[ end trace 0000000000000000 ]--- [ 73.804718] ------------[ cut here ]------------ [ 73.809372] WARNING: CPU: 1 PID: 4002 at kernel/trace/trace_functions_graph.c:933 print_graph_entry+0x582/0x590 [ 73.819492] Modules linked in: x86_pkg_temp_thermal fuse [ 73.824888] CPU: 1 UID: 0 PID: 4002 Comm: cat Tainted: G S W 6.16.0 #1 PREEMPT(voluntary) [ 73.834477] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 73.839314] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 73.846739] RIP: 0010:print_graph_entry+0x582/0x590 [ 73.851662] Code: 89 40 28 49 8b 46 10 49 89 40 30 49 8b 46 18 49 89 40 38 49 8b 46 20 49 89 40 40 e9 27 fe ff ff 90 0f 0b 90 e9 e2 fe ff ff 90 <0f> 0b 90 e9 8e fc ff ff e8 91 d8 10 01 90 90 90 90 90 90 90 90 90 [ 73.870458] RSP: 0018:ffffa6af02d37c58 EFLAGS: 00010282 [ 73.875733] RAX: ffffc6aeffd186f0 RBX: ffff9d70c83b0000 RCX: 00000011792c5f40 [ 73.882906] RDX: ffffffffbb374080 RSI: 00000000fefefefd RDI: 00000011792c5ff6 [ 73.890079] RBP: ffffa6af02d37cf0 R08: ffff9d70c1790cc0 R09: 0000000000000020 [ 73.897249] R10: 00000000fefefefe R11: 0000000000000004 R12: ffff9d70c83b2090 [ 73.904424] R13: 0000000000000001 R14: ffff9d70c1790ce0 R15: ffff9d70c1790cc0 [ 73.911609] FS: 00007fd8c6872740(0000) GS:ffff9d7274129000(0000) knlGS:0000000000000000 [ 73.919740] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 73.925528] CR2: 0000559a316ca3c0 CR3: 00000001049f4003 CR4: 00000000003726f0 [ 73.932705] Call Trace: [ 73.935200] <TASK> [ 73.937350] ? __legitimize_mnt+0x4/0xb0 [ 73.941342] print_graph_function_flags+0x27c/0x530 [ 73.946269] ? trace_hardirqs_on+0x2f/0x90 [ 73.950401] ? ring_buffer_empty_cpu+0x86/0xd0 [ 73.954912] print_graph_function+0x13/0x20 [ 73.959149] print_trace_line+0xbb/0x530 [ 73.963135] tracing_read_pipe+0x1d6/0x380 [ 73.967291] vfs_read+0xbb/0x380 [ 73.970580] ? vfs_read+0x9/0x380 [ 73.973954] ksys_read+0x7b/0xf0 [ 73.977244] __x64_sys_read+0x1d/0x30 [ 73.980952] x64_sys_call+0x1ada/0x20d0 [ 73.984839] do_syscall_64+0xb2/0x2b0 [ 73.988558] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 73.993657] RIP: 0033:0x7fd8c6904687 [ 73.997278] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff [ 74.016084] RSP: 002b:00007ffe8bb08e60 EFLAGS: 00000202 ORIG_RAX: 0000000000000000 [ 74.023743] RAX: ffffffffffffffda RBX: 00007fd8c6872740 RCX: 00007fd8c6904687 [ 74.030920] RDX: 0000000000040000 RSI: 00007fd8c6831000 RDI: 0000000000000003 [ 74.038101] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000 [ 74.045277] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fd8c6831000 [ 74.052484] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000 [ 74.059724] </TASK> [ 74.061974] ---[ end trace 0000000000000000 ]--- This patch was within the window where the issue started happening, and on inspection I found something suspicious (but couldn't verify since I'm traveling). On Thu, Feb 27, 2025 at 01:58:06PM -0500, Steven Rostedt wrote:
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index fbfb396905a6..77a8ba3bc1e3 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -72,17 +72,18 @@ FTRACE_ENTRY_REG(function, ftrace_entry, ); /* Function call entry */ -FTRACE_ENTRY_PACKED(funcgraph_entry, ftrace_graph_ent_entry, +FTRACE_ENTRY(funcgraph_entry, ftrace_graph_ent_entry, TRACE_GRAPH_ENT, F_STRUCT( __field_struct( struct ftrace_graph_ent, graph_ent ) __field_packed( unsigned long, graph_ent, func ) - __field_packed( int, graph_ent, depth ) + __field_packed( unsigned long, graph_ent, depth ) + __dynamic_array(unsigned long, args )
So we've added a dynamically sized array to the end of ftrace_graph_ent_entry, but in struct fgraph_data, the saved entry is defined as: struct fgraph_data { ... union { struct ftrace_graph_ent_entry ent; struct fgraph_retaddr_ent_entry rent; } ent; ... } Which doesn't seem to have room for args? The code in get_return_for_leaf() does: data->ent.ent = *curr; This copies the struct, but curr points to a larger entry with args data. The copy operation only copies sizeof(struct ftrace_graph_ent_entry) bytes, which doesn't include the dynamic args array. And then later functions (like print_graph_entry()) would go ahead and assume that iter->ent_size is sane and make a mess out of everything. I can't test right now whether this actually fixes the issues or not, but I wanted to bring this up as this looks somewhat odd and I'm not too familiar with this code. -- Thanks, Sasha