* [email protected] ([email protected]) wrote:
> Thanks Mathieu. I'm going to have to look at this a bit more. This may
> have fixed one problem and cause another as "lttctl -C -w /tmp/trace
> trace1" does not return and I'm getting a stack dump in the messages
> file (see below). The last msg I see from that "lttctl -C" command is
> "lttctl: Creating trace". The "lttctl: Forking lttd" is not displayed.

If you disable RCU stalls detection, does it work ?

[Context for Paul: it's on a -RT kernel, with the LTTng patchset]

Thanks,

Mathieu

> 
> JP
> 
> May 11 10:05:00 ans-ebb3 kernel:  P5
> May 11 10:05:00 ans-ebb3 kernel: INFO: RCU detected CPU stalls: P5 (detected 
> by 0, t=10002 jiffies)
> May 11 10:05:00 ans-ebb3 kernel: sending NMI to all CPUs:
> May 11 10:05:00 ans-ebb3 kernel: NMI backtrace for cpu 1
> May 11 10:05:00 ans-ebb3 kernel: Modules linked in: battery sbshc ehci_hcd 
> sbs scsi_dh ohci_hcd dm_multipath uhci_hcd scsi_mod dm_mirror sd_mod libata 
> ahci dm_mod dm_log dm_region_hash tpm_bios tpm intel_agp iTCO_vendor_support 
> i2c_core pcspkr tpm_tis iTCO_wdt i2c_i801 autofs4 e1000e button serio_raw sg 
> output video ac parport lp parport_pc ipv6 nvidia(P) [last unloaded: nvidia]
> May 11 10:05:00 ans-ebb3 kernel: 
> May 11 10:05:00 ans-ebb3 kernel: Pid: 0, comm: swapper Tainted: P            
> 2.6.33.2-rt13.11.trace #10 PXT    /To Be Filled By O.E.M.
> May 11 10:05:00 ans-ebb3 kernel: EIP: 0060:[<c062e4fa>] EFLAGS: 00000246 CPU: 
> 1
> May 11 10:05:00 ans-ebb3 kernel: EIP is at poll_idle+0x1e/0x5c
> May 11 10:05:00 ans-ebb3 kernel: EAX: f74bc000 EBX: 6b655cfc ECX: 6b655cfc 
> EDX: 000001a2
> May 11 10:05:00 ans-ebb3 kernel: ESI: 000001a2 EDI: 00000000 EBP: f74bcf84 
> ESP: f74bcf7c
> May 11 10:05:00 ans-ebb3 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 
> 0068 preempt:04010002
> May 11 10:05:00 ans-ebb3 kernel: Process swapper (pid: 0, ti=f74bc000 
> task=f74d90c0 task.ti=f74bc000)
> May 11 10:05:00 ans-ebb3 kernel: Stack:
> May 11 10:05:00 ans-ebb3 kernel:  f765b02c f765b01c f74bcf94 c062e7c9 
> 00000001 c0877c90 f74bcfa4 c0401c0d
> May 11 10:05:00 ans-ebb3 kernel: <0> 00000001 00000000 f74bcfb0 c06ca177 
> 01020800 00000000 00000000 00000000
> May 11 10:05:00 ans-ebb3 kernel: <0> 00000000 00000000 00000000 00000000 
> 00000000 00000000 00000000 000000d8
> May 11 10:05:00 ans-ebb3 kernel: Call Trace:
> May 11 10:05:00 ans-ebb3 kernel:  [<c062e7c9>] ? cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:00 ans-ebb3 kernel:  [<c0401c0d>] ? cpu_idle+0x49/0x80
> May 11 10:05:00 ans-ebb3 kernel:  [<c06ca177>] ? start_secondary+0x1b9/0x1be
> May 11 10:05:00 ans-ebb3 kernel: Code: ff ff ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 
> 55 89 e5 56 53 e8 7b 61 e2 ff 89 c3 89 d6 fb 89 e0 25 00 f0 ff ff eb 02 f3 90 
> f6 40 08 08 <74> f8 e8 60 61 e2 ff 29 d8 19 f2 e8 95 b6 e0 ff b9 40 42 0f 00 
> May 11 10:05:00 ans-ebb3 kernel: Call Trace:
> May 11 10:05:00 ans-ebb3 kernel:  [<c062e7c9>] cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:00 ans-ebb3 kernel:  [<c0401c0d>] cpu_idle+0x49/0x80
> May 11 10:05:00 ans-ebb3 kernel:  [<c06ca177>] start_secondary+0x1b9/0x1be
> May 11 10:05:00 ans-ebb3 kernel: Pid: 0, comm: swapper Tainted: P            
> 2.6.33.2-rt13.11.trace #10
> May 11 10:05:00 ans-ebb3 kernel: Call Trace:
> May 11 10:05:00 ans-ebb3 kernel:  [<c04019d4>] ? show_regs+0x1a/0x20
> May 11 10:05:00 ans-ebb3 kernel:  [<c06d0d7a>] nmi_watchdog_tick+0x9b/0x162
> May 11 10:05:00 ans-ebb3 kernel:  [<c06d02e1>] do_nmi+0xd1/0x2e1
> May 11 10:05:00 ans-ebb3 kernel:  [<c056355d>] ? 
> ltt_chan_for_each_channel+0x35/0x4f
> May 11 10:05:00 ans-ebb3 kernel:  [<c05628df>] ? 
> ltt_chanbuf_idle_switch+0x0/0x1c
> May 11 10:05:00 ans-ebb3 kernel:  [<c06cfd41>] nmi_stack_correct+0x28/0x2d
> May 11 10:05:00 ans-ebb3 kernel:  [<c062e4fa>] ? poll_idle+0x1e/0x5c
> May 11 10:05:00 ans-ebb3 kernel:  [<c062e7c9>] cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:00 ans-ebb3 kernel:  [<c0401c0d>] cpu_idle+0x49/0x80
> May 11 10:05:00 ans-ebb3 kernel:  [<c06ca177>] start_secondary+0x1b9/0x1be
> May 11 10:05:00 ans-ebb3 kernel: NMI backtrace for cpu 0
> May 11 10:05:00 ans-ebb3 kernel: Modules linked in: battery sbshc ehci_hcd 
> sbs scsi_dh ohci_hcd dm_multipath uhci_hcd scsi_mod dm_mirror sd_mod libata 
> ahci dm_mod dm_log dm_region_hash tpm_bios tpm intel_agp iTCO_vendor_support 
> i2c_core pcspkr tpm_tis iTCO_wdt i2c_i801 autofs4 e1000e button serio_raw sg 
> output video ac parport lp parport_pc ipv6 nvidia(P) [last unloaded: nvidia]
> May 11 10:05:00 ans-ebb3 kernel: 
> May 11 10:05:00 ans-ebb3 kernel: Pid: 0, comm: swapper Tainted: P            
> 2.6.33.2-rt13.11.trace #10 PXT    /To Be Filled By O.E.M.
> May 11 10:05:00 ans-ebb3 kernel: EIP: 0060:[<c0570ab6>] EFLAGS: 00000807 CPU: > 0
> May 11 10:05:00 ans-ebb3 kernel: EIP is at delay_tsc+0x0/0x8a
> May 11 10:05:00 ans-ebb3 kernel: EAX: 001c7b29 EBX: 00000001 ECX: 04f29000 
> EDX: 001c7b28
> May 11 10:05:00 ans-ebb3 kernel: ESI: 00000000 EDI: c084a6c0 EBP: c0828e6c 
> ESP: c0828e68
> May 11 10:05:00 ans-ebb3 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 
> 0068 preempt:04020002
> May 11 10:05:00 ans-ebb3 kernel: Process swapper (pid: 0, ti=c0828000 
> task=c08407a0 task.ti=c0828000)
> May 11 10:05:00 ans-ebb3 kernel: Stack:
> May 11 10:05:00 ans-ebb3 kernel:  c0570a4f c0828e78 c0418e84 c084a740 
> c0828eb0 c047c35b c07c1944 00000000
> May 11 10:05:00 ans-ebb3 kernel: <0> 00002712 c5805268 c084a6c0 c084a6ec 
> 00000003 c084a740 00000083 00000000
> May 11 10:05:00 ans-ebb3 kernel: <0> 00000000 c08407a0 c0828ec0 c047c45f 
> 00000000 00000000 c0828ed4 c0441fbd
> May 11 10:05:00 ans-ebb3 kernel: Call Trace:
> May 11 10:05:00 ans-ebb3 kernel:  [<c0570a4f>] ? __const_udelay+0x2f/0x31
> May 11 10:05:01 ans-ebb3 kernel:  [<c0418e84>] ? 
> arch_trigger_all_cpu_backtrace+0x42/0x4f
> May 11 10:05:01 ans-ebb3 kernel:  [<c047c35b>] ? __rcu_pending+0x258/0x309
> May 11 10:05:01 ans-ebb3 kernel:  [<c047c45f>] ? 
> rcu_check_callbacks+0x53/0x114
> May 11 10:05:01 ans-ebb3 kernel:  [<c0441fbd>] ? 
> update_process_times+0x32/0x43
> May 11 10:05:01 ans-ebb3 kernel:  [<c04587b7>] ? tick_sched_timer+0x6d/0x9b
> May 11 10:05:01 ans-ebb3 kernel:  [<c045874a>] ? tick_sched_timer+0x0/0x9b
> May 11 10:05:01 ans-ebb3 kernel:  [<c044f8fc>] ? __run_hrtimer+0xeb/0x154
> May 11 10:05:01 ans-ebb3 kernel:  [<c044fc4d>] ? hrtimer_interrupt+0xed/0x21e
> May 11 10:05:01 ans-ebb3 kernel:  [<c04518c5>] ? 
> sched_clock_idle_wakeup_event+0x16/0x18
> May 11 10:05:01 ans-ebb3 kernel:  [<c0418512>] ? 
> smp_apic_timer_interrupt+0xb4/0x106
> May 11 10:05:01 ans-ebb3 kernel:  [<c06cfa66>] ? 
> apic_timer_interrupt+0x2a/0x30
> May 11 10:05:01 ans-ebb3 kernel:  [<c062e4f6>] ? poll_idle+0x1a/0x5c
> May 11 10:05:01 ans-ebb3 kernel:  [<c062e7c9>] ? cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:01 ans-ebb3 kernel:  [<c0401c0d>] ? cpu_idle+0x49/0x80
> May 11 10:05:01 ans-ebb3 kernel:  [<c06bdde4>] ? rest_init+0x80/0x82
> May 11 10:05:01 ans-ebb3 kernel:  [<c087a859>] ? start_kernel+0x2fb/0x300
> May 11 10:05:01 ans-ebb3 kernel:  [<c087a08c>] ? i386_start_kernel+0x8c/0x91
> May 11 10:05:01 ans-ebb3 kernel: Code: c0 8b 0c 8d 18 7b 87 c0 6b c0 14 ba 40 
> ff 8d c0 55 89 e5 69 94 0a 98 00 00 00 fa 00 00 00 f7 e2 8d 42 01 ff 15 90 84 
> 85 c0 5d c3 <55> 89 e5 57 56 89 c6 53 83 ec 04 89 e0 25 00 f0 ff ff ff 40 14 
> May 11 10:05:01 ans-ebb3 kernel: Call Trace:
> May 11 10:05:01 ans-ebb3 kernel:  [<c0570a4f>] ? __const_udelay+0x2f/0x31
> May 11 10:05:01 ans-ebb3 kernel:  [<c0418e84>] 
> arch_trigger_all_cpu_backtrace+0x42/0x4f
> May 11 10:05:01 ans-ebb3 kernel:  [<c047c35b>] __rcu_pending+0x258/0x309
> May 11 10:05:01 ans-ebb3 kernel:  [<c047c45f>] rcu_check_callbacks+0x53/0x114
> May 11 10:05:01 ans-ebb3 kernel:  [<c0441fbd>] update_process_times+0x32/0x43
> May 11 10:05:01 ans-ebb3 kernel:  [<c04587b7>] tick_sched_timer+0x6d/0x9b
> May 11 10:05:01 ans-ebb3 kernel:  [<c045874a>] ? tick_sched_timer+0x0/0x9b
> May 11 10:05:01 ans-ebb3 kernel:  [<c044f8fc>] __run_hrtimer+0xeb/0x154
> May 11 10:05:01 ans-ebb3 kernel:  [<c044fc4d>] hrtimer_interrupt+0xed/0x21e
> May 11 10:05:01 ans-ebb3 kernel:  [<c04518c5>] ? 
> sched_clock_idle_wakeup_event+0x16/0x18
> May 11 10:05:01 ans-ebb3 kernel:  [<c0418512>] 
> smp_apic_timer_interrupt+0xb4/0x106
> May 11 10:05:01 ans-ebb3 kernel:  [<c06cfa66>] apic_timer_interrupt+0x2a/0x30
> May 11 10:05:01 ans-ebb3 kernel:  [<c062e4f6>] ? poll_idle+0x1a/0x5c
> May 11 10:05:01 ans-ebb3 kernel:  [<c062e7c9>] cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:01 ans-ebb3 kernel:  [<c0401c0d>] cpu_idle+0x49/0x80
> May 11 10:05:01 ans-ebb3 kernel:  [<c06bdde4>] rest_init+0x80/0x82
> May 11 10:05:01 ans-ebb3 kernel:  [<c087a859>] start_kernel+0x2fb/0x300
> May 11 10:05:01 ans-ebb3 kernel:  [<c087a08c>] i386_start_kernel+0x8c/0x91
> May 11 10:05:01 ans-ebb3 kernel: Pid: 0, comm: swapper Tainted: P            
> 2.6.33.2-rt13.11.trace #10
> May 11 10:05:01 ans-ebb3 kernel: Call Trace:
> May 11 10:05:01 ans-ebb3 kernel:  [<c04019d4>] ? show_regs+0x1a/0x20
> May 11 10:05:01 ans-ebb3 kernel:  [<c06d0d7a>] nmi_watchdog_tick+0x9b/0x162
> May 11 10:05:01 ans-ebb3 kernel:  [<c06d02e1>] do_nmi+0xd1/0x2e1
> May 11 10:05:01 ans-ebb3 kernel:  [<c05645a8>] ? ltt_vtrace+0x0/0x79b
> May 11 10:05:02 ans-ebb3 kernel:  [<c06cfd41>] nmi_stack_correct+0x28/0x2d
> May 11 10:05:02 ans-ebb3 kernel:  [<c0570ab6>] ? delay_tsc+0x0/0x8a
> May 11 10:05:02 ans-ebb3 kernel:  [<c0570a4f>] ? __const_udelay+0x2f/0x31
> May 11 10:05:02 ans-ebb3 kernel:  [<c0418e84>] 
> arch_trigger_all_cpu_backtrace+0x42/0x4f
> May 11 10:05:02 ans-ebb3 kernel:  [<c047c35b>] __rcu_pending+0x258/0x309
> May 11 10:05:02 ans-ebb3 kernel:  [<c047c45f>] rcu_check_callbacks+0x53/0x114
> May 11 10:05:02 ans-ebb3 kernel:  [<c0441fbd>] update_process_times+0x32/0x43
> May 11 10:05:02 ans-ebb3 kernel:  [<c04587b7>] tick_sched_timer+0x6d/0x9b
> May 11 10:05:02 ans-ebb3 kernel:  [<c045874a>] ? tick_sched_timer+0x0/0x9b
> May 11 10:05:02 ans-ebb3 kernel:  [<c044f8fc>] __run_hrtimer+0xeb/0x154
> May 11 10:05:02 ans-ebb3 kernel:  [<c044fc4d>] hrtimer_interrupt+0xed/0x21e
> May 11 10:05:02 ans-ebb3 kernel:  [<c04518c5>] ? 
> sched_clock_idle_wakeup_event+0x16/0x18
> May 11 10:05:02 ans-ebb3 kernel:  [<c0418512>] 
> smp_apic_timer_interrupt+0xb4/0x106
> May 11 10:05:02 ans-ebb3 kernel:  [<c06cfa66>] apic_timer_interrupt+0x2a/0x30
> May 11 10:05:02 ans-ebb3 kernel:  [<c062e4f6>] ? poll_idle+0x1a/0x5c
> May 11 10:05:02 ans-ebb3 kernel:  [<c062e7c9>] cpuidle_idle_call+0x6c/0xdf
> May 11 10:05:02 ans-ebb3 kernel:  [<c0401c0d>] cpu_idle+0x49/0x80
> May 11 10:05:02 ans-ebb3 kernel:  [<c06bdde4>] rest_init+0x80/0x82
> May 11 10:05:02 ans-ebb3 kernel:  [<c087a859>] start_kernel+0x2fb/0x300
> May 11 10:05:02 ans-ebb3 kernel:  [<c087a08c>] i386_start_kernel+0x8c/0x91
> 
> 
> 
> -----Original Message-----
> From: Mathieu Desnoyers [mailto:[email protected]]
> Sent: Tue 5/11/2010 7:59 AM
> To: John P. Paul
> Cc: [email protected]
> Subject: Re: [ltt-dev] lttctl locks up with RT Linux
>  
> * [email protected] ([email protected]) wrote:
> > Hey Mathieu:
> > 
> > Thanks for looking at this. I'm a bit new to debugging at this level, so
> > you may need to provide me a bit more info on what you need. I attempted
> > to use "pstack" on the lttctl and lttd tasks ... no luck as pstack also
> > locked up.
> > 
> > I put a bit of tracing into liblttctl and discovered the lockup occurs
> > when a write of "traceName" (whatever traceName happens to be) occurs to
> > the "/mnt/debugfs/ltt/destroy_trace" file.
> > 
> > I'm guessing that you would like some tracing of the ltt kernel module.
> > Is there something that I can turn on, or another way I could get a
> > stack dump of that module after lockup?  I'll do a little research this
> > weekend on kernel debugging techniques.
> > 
> > I can certainly sprinkle in some printk statements in the ltt kernel
> > module source. Doing provided the following info:
> > 
> > - Control entered _ltt_trace_destroy (single underscore)
> > - Control entered del_timer_sync(&ltt_async_wakeup_timer) and never
> > returned
> > 
> > Does that help, or should I continue farther down this path?
> 
> Can you try the following patch to see if it fixes your problem ?
> 
> 
> lttng fix rt kernel teardown deadlock
> 
> LTTng has a teardown bug on RT (deadlock):
> 
> Deleting a timer (sync) while holding the traces mutex, and the handler takes
> the same mutex, which leads to a deadlock.
> 
> Fix this by taking a RCU read lock in the timer (instead of the RT-specific 
> fix
> using the mutex), and by doing synchronize_rcu() in addition to
> synchronize_sched() upon updates.
> 
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> ---
>  ltt/ltt-tracer.c |   28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> Index: linux-2.6-lttng/ltt/ltt-tracer.c
> ===================================================================
> --- linux-2.6-lttng.orig/ltt/ltt-tracer.c     2010-05-11 07:50:46.000000000 
> -0400
> +++ linux-2.6-lttng/ltt/ltt-tracer.c  2010-05-11 07:55:46.000000000 -0400
> @@ -41,6 +41,14 @@
>  #include <linux/vmalloc.h>
>  #include <asm/atomic.h>
>  
> +static void synchronize_trace(void)
> +{
> +     synchronize_sched();
> +#ifdef CONFIG_PREEMPT_RT
> +     synchronize_rcu();
> +#endif
> +}
> +
>  static void async_wakeup(unsigned long data);
>  
>  static DEFINE_TIMER(ltt_async_wakeup_timer, async_wakeup, 0, 0);
> @@ -321,7 +329,7 @@ void ltt_module_unregister(enum ltt_modu
>               ltt_filter_unregister();
>               ltt_run_filter_owner = NULL;
>               /* Wait for preempt sections to finish */
> -             synchronize_sched();
> +             synchronize_trace();
>               break;
>       case LTT_FUNCTION_FILTER_CONTROL:
>               ltt_filter_control_functor = ltt_filter_control_default;
> @@ -429,13 +437,13 @@ static void async_wakeup(unsigned long d
>        * PREEMPT_RT does not allow spinlocks to be taken within preempt
>        * disable sections (spinlock taken in wake_up). However, mainline won't
>        * allow mutex to be taken in interrupt context. Ugly.
> -      * A proper way to do this would be to turn the timer into a
> -      * periodically woken up thread, but it adds to the footprint.
> +      * Take a standard RCU read lock for RT kernels, which imply that we
> +      * also have to synchronize_rcu() upon updates.
>        */
>  #ifndef CONFIG_PREEMPT_RT
>       rcu_read_lock_sched();
>  #else
> -     ltt_lock_traces();
> +     rcu_read_lock();
>  #endif
>       list_for_each_entry_rcu(trace, &ltt_traces.head, list) {
>               trace_async_wakeup(trace);
> @@ -443,7 +451,7 @@ static void async_wakeup(unsigned long d
>  #ifndef CONFIG_PREEMPT_RT
>       rcu_read_unlock_sched();
>  #else
> -     ltt_unlock_traces();
> +     rcu_read_unlock();
>  #endif
>  
>       mod_timer(&ltt_async_wakeup_timer, jiffies + LTT_PERCPU_TIMER_INTERVAL);
> @@ -901,7 +909,7 @@ int ltt_trace_alloc(const char *trace_na
>               set_kernel_trace_flag_all_tasks();
>       }
>       list_add_rcu(&trace->list, &ltt_traces.head);
> -     synchronize_sched();
> +     synchronize_trace();
>  
>       ltt_unlock_traces();
>  
> @@ -974,7 +982,7 @@ static int _ltt_trace_destroy(struct ltt
>       }
>       /* Everything went fine */
>       list_del_rcu(&trace->list);
> -     synchronize_sched();
> +     synchronize_trace();
>       if (list_empty(&ltt_traces.head)) {
>               clear_kernel_trace_flag_all_tasks();
>               /*
> @@ -1195,7 +1203,7 @@ static int _ltt_trace_stop(struct ltt_tr
>                       trace->nr_channels);
>               trace->active = 0;
>               ltt_traces.num_active_traces--;
> -             synchronize_sched(); /* Wait for each tracing to be finished */
> +             synchronize_trace(); /* Wait for each tracing to be finished */
>       }
>       module_put(ltt_run_filter_owner);
>       /* Everything went fine */
> @@ -1327,12 +1335,12 @@ static void __exit ltt_exit(void)
>       list_for_each_entry_rcu(trace, &ltt_traces.head, list)
>               _ltt_trace_stop(trace);
>       /* Wait for quiescent state. Readers have preemption disabled. */
> -     synchronize_sched();
> +     synchronize_trace();
>       /* Safe iteration is now permitted. It does not have to be RCU-safe
>        * because no readers are left. */
>       list_for_each_safe(pos, n, &ltt_traces.head) {
>               trace = container_of(pos, struct ltt_trace, list);
> -             /* _ltt_trace_destroy does a synchronize_sched() */
> +             /* _ltt_trace_destroy does a synchronize_trace() */
>               _ltt_trace_destroy(trace);
>               __ltt_trace_destroy(trace);
>       }
> 
> 
> 
> --
> This is an e-mail from General Dynamics Robotic Systems. It is for the 
> intended recipient only and may contain confidential and privileged 
> information. No one else may read, print, store, copy, forward or act in 
> reliance on it or its attachments. If you are not the intended recipient, 
> please return this message to the sender and delete the message and any 
> attachments from your computer. Your cooperation is appreciated.
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
ltt-dev mailing list
[email protected]
http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev

Reply via email to