On 13.02.2017 01:38, Borislav Petkov wrote:
On Sun, Feb 12, 2017 at 11:21:13PM +0100, Gabriel C wrote:
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_initcall_debug.mp4
http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_intcall_debug_ucode_off.mp4

Thanks and interesting. In both cases, mcheck_init_device() doesn't
return or we don't see the "initcall returned" message.

Ok, let's try a silly sprinkling of printks in that function and try to
pinpoint how far we manage to come.

Apply, build, boot and shoot video again :-)


I didn't tested your patch yet but did a boot with mce=off and nomce which 
seems to not
really works since is still want to mc_device_add() even when off.

See :


http://ftp.frugalware.org/pub/other/people/crazy/kernel/t/crash_mce_off.jpg

I'll build an .10-rc8 with your patch tomorrow .. is somewhat late now here :)


Another thing is .. there seems to be a real bug in tsc code .

I've build an -rc8 with a lot more debug options on an now I see the following :

...

[    4.321029] =================================
[    4.321909] [ INFO: inconsistent lock state ]
[    4.322789] 4.10.0-rc8-debug #1 Tainted: G          I
[    4.323879] ---------------------------------
[    4.324759] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[    4.325973] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[    4.326993]  (tick_broadcast_lock){?.....}, at: [<ffffffff810f40e7>] 
tick_broadcast_control+0x57/0x190
[    4.328879] {IN-HARDIRQ-W} state was registered at:
[    4.329866]   __lock_acquire+0x24f/0x19e0
[    4.330675]   lock_acquire+0xa5/0xd0
[    4.331399]   _raw_spin_lock_irqsave+0x54/0x90
[    4.332297]   tick_broadcast_switch_to_oneshot+0x11/0x50
[    4.333371]   tick_switch_to_oneshot+0x8c/0xd0
[    4.334269]   tick_init_highres+0x10/0x20
[    4.335079]   hrtimer_run_queues+0x5a/0xe0
[    4.335907]   run_local_timers+0x20/0x50
[    4.336699]   update_process_times+0x22/0x50
[    4.337562]   tick_periodic+0xa5/0xb0
[    4.338302]   tick_handle_periodic+0x1f/0x60
[    4.378065]   smp_trace_apic_timer_interrupt+0x74/0x90
[    4.418107]   smp_apic_timer_interrupt+0x9/0x10
[    4.458095]   apic_timer_interrupt+0x93/0xa0
[    4.498048]   mwait_idle+0x5a/0x90
[    4.537618]   arch_cpu_idle+0xa/0x10
[    4.577098]   default_idle_call+0x2c/0x30
[    4.616211]   do_idle+0x10c/0x1e0
[    4.654606]   cpu_startup_entry+0x5d/0x60
[    4.692388]   rest_init+0x12c/0x140
[    4.729557]   start_kernel+0x45f/0x46c
[    4.766325]   x86_64_start_reservations+0x2a/0x2c
[    4.803075]   x86_64_start_kernel+0xeb/0xf8
[    4.839178]   verify_cpu+0x0/0xfc
[    4.874629] irq event stamp: 71
[    4.909417] hardirqs last  enabled at (71): [<ffffffff8176a887>] 
_raw_spin_unlock_irq+0x27/0x50
[    4.945642] hardirqs last disabled at (70): [<ffffffff81762eaa>] 
__schedule+0x13a/0x7c0
[    4.981797] softirqs last  enabled at (0): [<ffffffff81060ef0>] 
copy_process+0x7c0/0x1ea0
[    5.018580] softirqs last disabled at (0): [<          (null)>]           
(null)
[    5.055677]
               other info that might help us debug this:
[    5.072455] tsc: Refined TSC clocksource calibration: 2266.746 MHz
[    5.072467] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 
0x20ac7f6ecc6, max_idle_ns: 440795315461 ns
[    5.202828]  Possible unsafe locking scenario:


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This seems to be the place where the other patch breaks hell here..


[    5.276704]        CPU0
[    5.312400]        ----
[    5.347605]   lock(tick_broadcast_lock);
[    5.383163]   <Interrupt>
[    5.418457]     lock(tick_broadcast_lock);
[    5.454015]
                *** DEADLOCK ***

[    5.557982] no locks held by cpuhp/0/14.
[    5.592295]
               stack backtrace:
[    5.657946] CPU: 0 PID: 14 Comm: cpuhp/0 Tainted: G          I     
4.10.0-rc8-debug #1
[    5.690740] Hardware name: FUJITSU                          PRIMERGY TX200 
S5             /D2709, BIOS 6.00 Rev. 1.14.2709              02/04/2013
[    5.758323] Call Trace:
[    5.791434]  dump_stack+0x86/0xc1
[    5.824421]  print_usage_bug+0x283/0x2a0
[    5.857357]  mark_lock+0x39e/0x650
[    5.890256]  ? check_usage_forwards+0xf0/0xf0
[    5.923436]  __lock_acquire+0x2ba/0x19e0
[    5.956617]  ? pick_next_task_fair+0x350/0x700
[    5.989903]  ? finish_task_switch+0x184/0x220
[    6.023171]  ? debug_smp_processor_id+0x17/0x20
[    6.056667]  lock_acquire+0xa5/0xd0
[    6.089882]  ? tick_broadcast_control+0x57/0x190
[    6.123395]  ? smpboot_thread_fn+0x28/0x250
[    6.156838]  _raw_spin_lock+0x3c/0x80
[    6.190175]  ? tick_broadcast_control+0x57/0x190
[    6.223914]  tick_broadcast_control+0x57/0x190
[    6.257846]  ? finish_task_switch+0x184/0x220
[    6.291900]  ? smpboot_thread_fn+0x28/0x250
[    6.325991]  intel_idle_cpu_online+0x1d/0x100
[    6.360220]  cpuhp_invoke_callback+0x62/0x120
[    6.394397]  ? smpboot_thread_fn+0x28/0x250
[    6.428451]  cpuhp_thread_fun+0x87/0x110
[    6.462611]  smpboot_thread_fn+0x227/0x250
[    6.496805]  kthread+0x125/0x130
[    6.530947]  ? sort_range+0x20/0x20
[    6.564126]  ? kthread_stop+0x70/0x70
[    6.597078]  ret_from_fork+0x31/0x40
[    6.630787] clocksource: Switched to clocksource tsc

...


Maybe Thomas has some ideas ?


Regards,

Gabriel C

Reply via email to