On Wed, Feb 17, 2016 at 09:16:46AM +0100, Ingo Molnar wrote: > So I'm wondering why this started triggering only now. Is this a pre-existing > bug > that somehow got triggered via: > > 58122bf1d856 x86/fpu: Default eagerfpu=on on all CPUs > > ?
Well, that's an interesting question. See, the thing is, I triggered this only *once* by accident and I haven't seen it ever since. The "reliable" "reproducer" I used to debug this was Andy's suggestion to stick a schedule() in __fpu__restore_sig(). So the answer to that question is not easy. BUT(!), regardless, the bug still needs to be fixed because my tracing here https://lkml.kernel.org/r/[email protected] showed that getting preempted after setting fpu->fpstate_active = 1; leads to the WARN. Because - and please doublecheck me on that - when we're in __switch_to() and the task which already has ->fpstate_active set and it is the next task to which we're going to switch to, when it enters switch_fpu_prepare(), it does: fpu.preload = static_cpu_has(X86_FEATURE_FPU) && new_fpu->fpstate_active && ^^^^^^^^^^^^^^^^^^^^^^^ so that fpu.preload is set now. A bit later in that same function: /* Don't change CR0.TS if we just switch! */ if (fpu.preload) { new_fpu->counter++; __fpregs_activate(new_fpu); ^^^^^^^^^^^^^^^^^ ->fpregs_active gets set here and when the task returns to __fpu__restore_sig(), fpu__restore() sets it again, leading to the WARN. Mind you, this happens on 32-bit only because there we sigreturn with irqs enabled. Look at the call trace. > If yes then we need a plausible theory of how that never triggered on > modern Intel CPUs that had eagerfpu enabled for years. AFAICT, it triggers - and the window is very small at that - only on 32-bit. If at all. > Or perhaps was it caused by one of the other changes in tip:x86/fpu: > > c6ab109f7e0e x86/fpu: Speed up lazy FPU restores slightly > a20d7297045f x86/fpu: Fold fpu_copy() into fpu__copy() > 5ed73f40735c x86/fpu: Fix FNSAVE usage in eagerfpu mode > 4ecd16ec7059 x86/fpu: Fix math emulation in eager fpu mode > > ? I can certainly try to test all those but I don't have a reliable reproducer. The only thing I could do is check out each of those commits and stick a schedule() in __fpu__restore_sig() and see what happens. But if my analysis above is right, none of those would matter because of the mechanism of how the warn happens... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.

