On Wednesday, November 28, 2018 at 4:51:08 PM UTC-5, Nadav Har'El wrote:
>
> On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <[email protected]
> <javascript:>> wrote:
>
>> The title of this thread indicates that the crash happens sporadically
>> and indeed it is the case with the scenario described in 1st and 2nd email
>> of this thread. The symptom (stack trace with fmt_fp() function captured by
>> gdb) looks identical to what was reported in
>> https://github.com/cloudius-systems/osv/issues/1010 and this
>> https://github.com/cloudius-systems/osv/issues/536. We cannot be certain
>> about whether the root cause is the same. However it is worth mentioning
>> that crash happens with same symptom almost every time if I run the
>> scenario described in issue #1010 which is different (but similar) than
>> scenario described in 1st and 2nd email of this thread. I am emphasizing
>> this because we have a way to replicate one of these crashes (*possibly
>> same root cause*) in very repeatable way in case we have ideas of things
>> to try.
>>
>
> That is good. If we give up on this effort (at least temporarily), please
> summarize what you know about reproducing this issue in a new bug-tracker
> issue (and reference 1010 and 536) so we could return to it later.
>
Created new issue - https://github.com/cloudius-systems/osv/issues/1018
>
>
>> Regarding your excellent explanations how non-FPU and FPU state is
>> saved/restored by OSv I wanted to run a scenario to see if I understand it
>> well (please forgive my ignorance if I am missing something obvious).
>> Imagine we have 2 user threads (uT1 and uT2) executing some *logic that
>> involves floating point operations in a loop* (like the one in the
>> disassembled code). Ffmpeg is a good example as it must be very CPU
>> intensive and doing all kind of FP computations to transcode video. Let us
>> say on given cpu this sequence happens in time:
>>
>> ->uT1 is running (executing long FP logic loop)
>> --> timer triggered interrupt happens because allotted time slice expired
>> [some FPU registers had some inflight data = FPU has state S1)
>> ----> OSv saves FPU state and other registers using fpu_lock construct
>>
>
> So far, right. This state is saved inside the interrupt stack, which while
> thread uT1 is running, is uT1->_arch->interrupt_stack.
> By the way, note that the regular (non-FPU) registers are also on this
> stack - the code in entry.S saves them on the stack and passes the pointer
> to it when it calls interrupt().
>
> -------> OSv identifies next thread to run which is uT2
>> ---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and
>> other registers as well and switches to uT2
>>
>
> No, this is not what happens now (at least, not if you mean that "S1" was
> uT1's state...). Before interrupt() ends or fpu_lock runs out of scope,
> interrupt calls sched::preempt().
> This is a weird function. It doesn't do something quickly and return :-)
> Rather, it calls the scheduler to decide which thread to run next. If it
> decides to run uT1 again, it just returns normally (and the FPU state is
> restored and the interrupt returns).
> But if it decides to run uT2 it doesn't quite return... Rather, it
> *switches* to thread uT2. This means we switch to uT2's stack, and jump to
> the program counter where uT2 was before. And where was uT2 running before?
> One of the likely options was that it was itself in interrupt() calling
> preempt() :-) So now, finally, when uT2 resumes, its call to preempt()
> "returns", and we reach the end of the interrupt() function. Remember that
> at this point we're running uT2, with uT2's stack. So when interrupt()
> ends, the fpu_lock goes out of scope, and the FPU state is popped from
> *uT2*'s stack. This is was uT2's FPU state, saved long ago, and now we're
> restoring it, and when interrupt() returns, uT2's correct FPU state will
> have been restored.
>
>
>> --------------> UT2 is running (executing long FP logic loop)
>> --> timer triggered interrupt happens again because allotted time slice
>> expired [some FPU registers had some inflight data = FPU has state S2)
>> ... same thing
>> ... same thing
>> ---------> OSv restores FPU to state S2 and other registers as well and
>> switches back to uT1
>> *--------------> UT1 is resuming where it left above and instead of FPU
>> be in expected S1 state it sees FPU in state S2 -> crash????*
>>
>> What am I missing in my scenario?
>>
>> Lastly I am adding the disassembled portion of fmt_fp() with applied
>> patch like below that makes the repeatable crash (from issue #1010) go away
>> or at least way less frequent.
>>
>> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
>> index aac790c0..1e116038 100644
>> --- a/libc/stdio/vfprintf.c
>> +++ b/libc/stdio/vfprintf.c
>> @@ -8,6 +8,7 @@
>> #include <inttypes.h>
>> #include <math.h>
>> #include <float.h>
>> +#include <assert.h>
>> /* Some useful macros */
>> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int
>> p, int fl, int t)
>> if (e2<0) a=r=z=big;
>> else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>> + int steps = 0;
>> do {
>> *z = y;
>> y = 1000000000*(y-*z++);
>> + steps++;
>> + if(steps > 2000) {
>> + assert(0);
>> + }
>> } while (y);
>> while (e2>0) {
>>
>> Your explanation was that because of added assert() function call we
>> force compiler to generate code to save/restore any used FPU registers.
>>
>
> Yes, my thesis (and I'm too tired now to verify this in the assembler code
> below) was that because there is this function call (and the compiler
> doesn't know if it is commonly called or not), the compiler doesn't want to
> leave y in a register, so it will load and store it from memory. But this
> is just a guess.
> You can try instead of assert(0), do
>
> asm volatile ("cli; hlt" : : : "memory");
>
> This is not a function call, so it will not change the compiled code. It
> will allow you to connect with a debugger to print more stuff at that
> point. Note that this cli; hlt stuff will only hang the single CPU that ran
> this command (if you can reproduce this with -c1, this will be the only cpu
> so things will be easier).
>
>
>
>> But if we do not see the crash with this patch applied happeny any more,
>> then the assert function does not get called and FPU not saved/restored, so
>> how this helps make this crash go away?
>>
>> Portion of disassembled code:
>> if (e2<0) a=r=z=big;
>> else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>>
>> int steps = 0;
>> do {
>> *z = y;
>> 501: d9 bd 0e e3 ff ff fnstcw -0x1cf2(%rbp)
>> if (e2<0) a=r=z=big;
>> 507: 48 8d 85 40 e3 ff ff lea -0x1cc0(%rbp),%rax
>> 50e: 45 85 c9 test %r9d,%r9d
>> 511: 48 8d 95 cc fe ff ff lea -0x134(%rbp),%rdx
>> 518: 48 89 c1 mov %rax,%rcx
>> *z = y;
>> 51b: 0f b7 85 0e e3 ff ff movzwl -0x1cf2(%rbp),%eax
>> if (e2<0) a=r=z=big;
>> 522: 48 0f 49 ca cmovns %rdx,%rcx
>> *z = y;
>> 526: 80 cc 0c or $0xc,%ah
>> if (e2<0) a=r=z=big;
>> 529: 48 89 8d d8 e2 ff ff mov %rcx,-0x1d28(%rbp)
>> y = 1000000000*(y-*z++);
>> 530: 48 8d 59 04 lea 0x4(%rcx),%rbx
>> 534: 48 8d 91 44 1f 00 00 lea 0x1f44(%rcx),%rdx
>> *z = y;
>> 53b: 66 89 85 0c e3 ff ff mov %ax,-0x1cf4(%rbp)
>> 542: d9 c0 fld %st(0)
>> 544: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp)
>> 54a: df bd 00 e3 ff ff fistpll -0x1d00(%rbp)
>> 550: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp)
>> 556: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax
>> 55d: 89 01 mov %eax,(%rcx)
>> y = 1000000000*(y-*z++);
>> 55f: 89 c0 mov %eax,%eax
>> 561: 48 89 85 f8 e2 ff ff mov %rax,-0x1d08(%rbp)
>> 568: df ad f8 e2 ff ff fildll -0x1d08(%rbp)
>> 56e: de e9 fsubrp %st,%st(1)
>> 570: d9 05 00 00 00 00 flds 0x0(%rip) # 576
>> <fmt_fp+0x176>
>> 576: dc c9 fmul %st,%st(1)
>> steps++;
>> if(steps > 2000) {
>> 578: eb 48 jmp 5c2 <fmt_fp+0x1c2>
>> 57a: d9 c9 fxch %st(1)
>> 57c: eb 04 jmp 582 <fmt_fp+0x182>
>> 57e: 66 90 xchg %ax,%ax
>> 580: d9 c9 fxch %st(1)
>> *z = y;
>> 582: d9 c0 fld %st(0)
>> 584: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp)
>> 58a: df bd 00 e3 ff ff fistpll -0x1d00(%rbp)
>> 590: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp)
>> y = 1000000000*(y-*z++);
>> 596: 48 83 c3 04 add $0x4,%rbx
>> *z = y;
>> 59a: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax
>> 5a1: 89 43 fc mov %eax,-0x4(%rbx)
>> y = 1000000000*(y-*z++);
>> 5a4: 89 c0 mov %eax,%eax
>> 5a6: 48 89 85 f8 e2 ff ff mov %rax,-0x1d08(%rbp)
>> 5ad: df ad f8 e2 ff ff fildll -0x1d08(%rbp)
>> 5b3: de e9 fsubrp %st,%st(1)
>> 5b5: d8 c9 fmul %st(1),%st
>> if(steps > 2000) {
>> 5b7: 48 39 d3 cmp %rdx,%rbx
>> 5ba: 0f 84 99 10 00 00 je 1659 <fmt_fp+0x1259>
>> 5c0: d9 c9 fxch %st(1)
>> assert(0);
>> }
>> } while (y);
>> 5c2: d9 ee fldz
>> 5c4: d9 ca fxch %st(2)
>> 5c6: db ea fucomi %st(2),%st
>> 5c8: dd da fstp %st(2)
>> 5ca: 7a ae jp 57a <fmt_fp+0x17a>
>> 5cc: 75 b2 jne 580 <fmt_fp+0x180>
>> 5ce: dd d8 fstp %st(0)
>> 5d0: dd d8 fstp %st(0)
>>
>> On Wednesday, November 28, 2018 at 8:14:05 AM UTC-5, Nadav Har'El wrote:
>>>
>>>
>>> On Wed, Nov 28, 2018 at 2:18 PM Waldek Kozaczuk <[email protected]>
>>> wrote:
>>>
>>>> On Nov 28, 2018, at 03:58, Nadav Har'El <[email protected]> wrote:
>>>>
>>>>
>>>> The situation is different with involuntary context switches. When an
>>>> asynchronous event, e.g., an interrupt, occurs, the user thread is in a
>>>> random position in the code. It may be using all its registers, and the
>>>> FPU
>>>> state (including the old-style FPU and the new SSE and AVX registers).
>>>> Because our interrupt handler (which may do anything from running the
>>>> scheduler to reading a page from disk on page fault) may need to use any
>>>> of
>>>> these registers, all of them, including the FPU, need to be saved on
>>>> interrupt time. The interrupt has a separate stack, and the FPU is saved
>>>> on
>>>> this stack (see fpu_lock use in interrupt()). When the interrupt finishes,
>>>> this FPU is restored. This includes involuntary context switching: thread
>>>> A
>>>> receives an interrupt, saves the FPU, does something and decides to switch
>>>> to thread B, and a while later we switch back to thread A at which point
>>>> the interrupt handler "returns" and restores the FPU state.
>>>>
>>>> Does involuntary case include scenario when enough time designated for
>>>> current thread by scheduler expires? I would imaging this would qualify as
>>>> interrupt?
>>>>
>>>
>>> Indeed. We set a timer to when this thread's runtime quota will expire,
>>> and this timer generates an interrupt. Check out interrupt() - after saving
>>> the FPU state and acknowledging the interrupt (EOI), it calls the scheduler
>>> (sched::preempt()). This will decide which thread to run next - it may run
>>> the same thread again, or a different thread. When sched::preempt() returns
>>> - possibly a long time later, it means the scheduler decided to run *this*
>>> thread again. At that point, interrupt() returns and just before returning
>>> it restores the FPU state automatically.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.