Fwd: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Waldek Kozaczuk Sun, 02 Dec 2018 14:05:49 -0800

Forgot to reply to entire group.

---------- Forwarded message ---------
From: Waldek Kozaczuk <[email protected]>
Date: Sun, Dec 2, 2018 at 16:26
Subject: Re: OSv crashes fairly sporadically with page fault when
transcoding video with ffmpeg
To: Nadav Har'El <[email protected]>



I think I have good news but I am adding my comments inlined.
On Sun, Dec 2, 2018 at 7:08 AM Nadav Har'El <[email protected]> wrote:

> On Sat, Dec 1, 2018 at 6:26 PM Waldek Kozaczuk <[email protected]>
> wrote:
>
>> Here we go again ... I was thinking more about it.
>>
>> First of regarding potential corruption of FPU state during involuntary
>> interrupt task switch, unless my code that verifies it is not the case
>> using the checksum logic (see my previous email with the patch), there
>> could be another possibility - we are telling xsave/xstor to save/restore
>> to/from wrong location in memory. Meaning that whatever got saved stays not
>> corrupted but somehow we are restoring wrong data for whatever reason.
>> Probably very unlikely.
>>
>
> Yes, I think it is unlikely - since you are checking the checksum, it
> means we are restoring valid saved state, but someone else's saved state. I
> don't have an idea how this can happen.
>
> If we rule out corruption of the saved FPU state, I can think of three
> remaining options:
>
> 1. This is not a saved-FPU-state corruption - printf() really goes into an
> endless loop when it gets a certain input. I'm still hoping you can get
> more detailed printouts on what exactly what printf()'s *input* when it
> went into this endless loop. One thing you can do is to have a variable
> (perhaps thread-local global if you don't want it optimized out?) holding
> that printf_fp's last input, and when you get a crash, print this global.
> Or something.
>
2. This is a situation of some asynchronous even where we ran a handler -
> especially the scheduler which uses the FPU - but we *forgot* to save the
> FPU state. I don't know what this can be. This can perhaps be debugged with
> tracepoints - enable various tracepoints, especially the scheduler, and fpu
> saves/restores, and after the crash look at the tracepoint log. Can we find
> unexpected scheduler calls without an fpu save?
>
> 3. This is a situation where we saved the FPU, but *after* doing some FPU
> work. We had and fix this kind of issue in
> 4349bfd3583df03d44cda480049e630081d3a20b but it's more surprising to see
> this kind of with x87 registers, not SSE, I can't think of a reason why the
> compiler will touch these registers without our intent.
>
>
>> Secondly I think that part of our theory, that this famous loop in
>> vfprintf loops infinitely until tries to access unmapped page of memory, is
>> very likely correct. But it might happening for different reason. So far we
>> thought that simply the number is NaN or corrupt that makes FPU Instruction
>> calculate NAN forever. But what if for whatever reason we are experiencing
>> FPU stack overflow - all registers are taken and any instruction trying to
>> load data into it keeps failing because it is full.
>>
>
> This is an interesting point. The eight x87 registers are freely
> addressable (st0 through st7), but can *also* be used as a stack. The
> current position is stored as three bits in the floating-point status word
> (FSW), and the fxsave (or newer xsave...) saves this as well on involuntary
> context switches.
>
Here is extra evidence that seems to validate at least some of the elements
of my theory.

When I add this to vfprintf:

> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
>
> index aac790c0..5cd31b3e 100644
>
> --- a/libc/stdio/vfprintf.c
>
> +++ b/libc/stdio/vfprintf.c
>
> @@ -209,6 +209,9 @@ static char *fmt_u(uintmax_t x, char *s)
>
>
>
>  static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t)
>
>  {
>
> +       uint16_t fpu_status;
>
> +       asm volatile ("fnstsw %0" : "=m" (fpu_status));
>
> +
>
>         uint32_t big[(LDBL_MAX_EXP+LDBL_MANT_DIG)/9+1];
>
>         uint32_t *a, *d, *r, *z;
>
>         int e2=0, e, i, j, l;
>
> @@ -654,8 +657,16 @@ static int printf_core(FILE *f, const char *fmt,
>> va_list *ap, union arg *nl_arg,
>
>         return 1;
>
>  }
>
>
>
> +//The TLS version cannot be printed by GDB:
>
> +//(gdb) p vfprintf_fpu_status
>
> +// Cannot find thread-local storage for Thread 1, executable file
>> build/release.x64/loader.elf:
>
> +// TLS not supported or disabled on this target
>
> +//static __thread uint16_t vfprintf_fpu_status = 0;
>
> +static uint16_t vfprintf_fpu_status = 0;
>
>  int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap)
>
>  {
>
> +       asm volatile ("fnstsw %0" : "=m" (vfprintf_fpu_status));
>
> +
>
>         va_list ap2;
>
>         int nl_type[NL_ARGMAX+1] = {0};
>
>         union arg nl_arg[NL_ARGMAX+1];
>
>
>  and capture both variable (FPU status word) after crash:

> (gdb) bt
>
> #0  0x00000000003a8542 in processor::cli_hlt () at
>> arch/x64/processor.hh:247
>
> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>
> #2  osv::halt () at arch/x64/power.cc:24
>
> #3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63199b "Aborted\n") at
>> runtime.cc:132
>
> #4  0x0000000000202765 in abort () at runtime.cc:98
>
> #5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>,
>> ef=0xffff800004b8f068) at core/mmu.cc:1316
>
> #6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984,
>> ef=ef@entry=0xffff800004b8f068) at core/mmu.cc:1330
>
> #7  0x00000000003a239c in page_fault (ef=0xffff800004b8f068) at
>> arch/x64/mmu.cc:38
>
> #8  <signal handler called>
>
> #9  0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
>> t=102) at libc/stdio/vfprintf.c:303
>
> #10 0x0000000000000000 in ?? ()
>
> (gdb) frame 9
>
> #9  0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
>> t=102) at libc/stdio/vfprintf.c:303
>
> 303 *z = y;
>
> (gdb) p/t fpu_status
>
> *$1 = 1100001*
>
> (gdb) p/t vfprintf_fpu_status
>
> *$2 = 11000001101001*
>
>
> According to this http://www.ray.masmcode.com/tutorial/fpuchap1.htm#sword
- FPU status word at the beginning of vfprint does not look anything clean.
So something left it not clean before.

>
> The calling convention when calling a function (i.e., voluntary context
> switch) is that this stack is popped. The same is true on return (except
> st0 can be used for returning a floating-point value, if this is a function
> returning a floating point value). I don't know what will happen if a
> function violates this convention.
>
>
>> I believe this exception is disabled but I can enable it to see if I am
>> correct and it happens somewhere in the code. Now what if after voluntary
>> or involuntary thread switch the FPU stack is full and the thread switched
>> to simply does not know it. Whose responsibility is in this case to make
>> sure that FPU is in clean state (empty)?
>>
>
> When we switch *back* to an old thread, we should switch to its old FPU
> stack, including its old position in the x87 stack, so if it's full, it's
> this thread's fault.
> But on a *new* thread, we do need to make sure it got a clean slate. We
> had a bug in this long ago, which 0e8d9b5e0a1af02a31799fe1d40e434bc2c2a5fb
> fixed - thread_main_c() now calls processor::init_fpu(). This calls the
> FNINIT instruction, which among other things, resets the stack postion (in
> FSW) to 0.
>
> But now that I think about this, in case of an involuntary switch to the
> kernel (e.g., interrupt) what FPU state does the kernel code get? What if
> the kernel gets a full fpu stack and then can't do its own FPU
> calculations? Should fpu_lock() run fpu_init() after saving the FPU state?
> In any case, I don't think this has anything to do with the problem at
> hand. I'll open a separate issue.
>
>
>> Maybe the involuntary switch has/should have two sub-cases:
>> 1) the thread we are switching to before was also involuntarily switched
>> and we need to restore its FPU state from interrupt stack which we do
>> 2) the thread we are switching to voluntarily yielded and ... I think FPU
>> should be in somewhat clean state, at least be empty unless something left
>> it full before, right?
>>
>
> I think you may have indeed found a problem (I'm not sure it's related to
> the bug we're chasing, but a problem nontheless, and *may* be related):
>
> We already have these two cases, implicitly: As I explained in a previous
> mail, when we context switch *back* to a thread, we get back to where it
> was when it gave up the CPU:
> 1. If this was an involuntary context switch, we find ourselves returning
> (very belatedly) from the scheduler to the end of the interrupt() function,
> and now this function ends and pops the saved fpu (from the thread's
> interrupt stack), and finally returns from interrupt() to the original
> thread.
> 2. If this was a voluntary context switch, the scheduler returns into
> whatever function which called it (sleep(), read(), etc.) and that returns
> to. The floating point state was never saved (it doesn't need to be,
> because a function was called) but, on return, it was never restored. It
> just gets a copy of whatever junk was there from the previous thread or OSv
> code that ran. The junk st0..st7 registers are fine - the caller cannot
> count on anything in there too. But the control registers might now have
> junk as well, and that OSv does have to restore, and that it does since
> commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never
> restore the floating point *status* word (FSW). So if one thread was
> interrupted with a full fp stack, another thread we switch to might be
> returned-to with a full stack.
>
> *You can try adding an FNINIT in arch-switch.hh *before*
> processor::fldcw(fpucw) and see if it fixes anything.*
>
> Indeed this what I did:

> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
>
> index e3769540..7cd8d181 100644
>
> --- a/arch/x64/arch-switch.hh
>
> +++ b/arch/x64/arch-switch.hh
>
> @@ -103,6 +103,7 @@ void thread::switch_to()
>
>             [rip]"i"(offsetof(thread_state, rip))
>
>           : "rbx", "rdx", "rsi", "rdi", "r8", "r9",
>
>             "r10", "r11", "r12", "r13", "r14", "r15", "memory");
>
> *+    asm volatile("fninit");*
>
>      processor::fldcw(fpucw);
>
>      processor::ldmxcsr(mxcsr);
>
>  }
>
>
> *And at least the issue described in #1010 went away based on around 30-40
executions which used to fail almost every time. So I think is very good
news !!!  *

By the way, to be honest, all this discussion is about x87. I don't know
> what is the modern SSE equivalent :-(
>
> Finally, it may indeed make sense to enable fpu stack overflow exceptions
> (and their SSE equivalent??) to see if we have any of these problems.
>
>
>
>
>
>> My thinking might be all wrong. But just food for thought.
>>
>> Waldek
>>
>> Sent from my iPhone
>>
>> On Dec 1, 2018, at 00:46, Waldek Kozaczuk <[email protected]> wrote:
>>
>> My theory has fallen apart: I modified httpserver like this to pass
>> constant float value of 0.5 and still getting same crash:
>>
>> index 358a86c6..126ab5f4 100644
>> --- a/modules/httpserver-api/api/os.cc
>> +++ b/modules/httpserver-api/api/os.cc
>> @@ -131,7 +131,7 @@ void init(routes& routes)
>>              thread.migrations = t.stat_migrations.get();
>>              thread.preemptions = t.stat_preemptions.get();
>>              thread.name = t.name();
>> -            thread.priority = t.priority();
>> +            thread.priority = 1.0;//t.priority();
>>              thread.stack_size = t.get_stack_info().size;
>>              thread.status = t.get_status();
>>              threads.list.push(thread);
>>
>>  I even created separate test program that keeps sprintf'ing both NAN and
>> regular float numbers using 5 threads to replace ffmpeg and I could not
>> replicate same program running exact same scenario with httpserver.
>>
>> As I wrote in other thread I added checksum logic to detect FPU state
>> corruption that unfortunately does not detect anything. But is this logic
>> actually correct?
>>
>> Ediff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
>> index bb96af3d..34bf6081 100644
>> --- a/arch/x64/arch-cpu.cc
>> +++ b/arch/x64/arch-cpu.cc
>> @@ -42,6 +42,14 @@ exception_guard::~exception_guard()
>>      sched::cpu::current()->arch.exit_exception();
>>  }
>>
>> +static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
>> +    u64 val = 0;
>> +    char* data = s->legacy;
>> +    for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)
>> +       val += data[i];
>> +    return val;
>> +}
>> +
>>  extern "C"
>>  [[gnu::target("no-sse")]]
>>  void fpu_state_init_xsave(processor::fpu_state *s) {
>> @@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {
>>  extern "C"
>>  void fpu_state_save_xsave(processor::fpu_state *s) {
>>      processor::xsave(s, -1);
>> +    s->checksum = calculate_fpu_state_checksum(s);
>>  }
>>
>>  extern "C"
>>  void fpu_state_save_fxsave(processor::fpu_state *s) {
>>      processor::fxsave(s);
>> +    s->checksum = calculate_fpu_state_checksum(s);
>>  }
>>
>>  extern "C"
>>  void fpu_state_restore_xsave(processor::fpu_state *s) {
>>      processor::xrstor(s, -1);
>> +    assert(s->checksum == calculate_fpu_state_checksum(s));
>>  }
>>
>>  extern "C"
>>  void fpu_state_restore_fxsave(processor::fpu_state *s) {
>>      processor::fxrstor(s);
>> +    assert(s->checksum == calculate_fpu_state_checksum(s));
>>  }
>>
>>  extern "C"
>> diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh
>> index 6b1fbf66..f699d5b8 100644
>> --- a/arch/x64/processor.hh
>> +++ b/arch/x64/processor.hh
>> @@ -343,6 +343,7 @@ struct fpu_state {
>>      char xsavehdr[24];
>>      char reserved[40];
>>      char ymm[256];
>> +    u64 checksum;
>>  } __attribute__((packed));
>>
>>  inline void fxsave(fpu_state* s)
>>
>> On Friday, November 30, 2018 at 7:43:31 PM UTC-5, Nadav Har'El wrote:
>>>
>>> On Thu, Nov 29, 2018 at 9:33 PM Waldek Kozaczuk <[email protected]>
>>> wrote:
>>>
>>>> I wonder if per this - https://software.intel.com/en-us/node/523375 -
>>>> we are enabling correct bits for xsave if caste is what being used here.
>>>>
>>>> “The second operand is a save/restore mask specifying the
>>>> saved/restored extended states. The value of the mask is ANDed with
>>>> XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored
>>>> only if the corresponding bit of both save/restore mask and
>>>> XFEATURE_ENABLED_MASK is set to '1'.”
>>>>
>>>
>>> Interesting question. In arch/x64/arch-cpu.hh, arch_cpu::init_on_cpu(),
>>> you can see:
>>>
>>>     if (features().xsave) {
>>>         auto bits = xcr0_x87 | xcr0_sse;
>>>         if (features().avx) {
>>>             bits |= xcr0_avx;
>>>         }
>>>         write_xcr(xcr0, bits);
>>>     }
>>>
>>> So we're turning on the "x87", "sse", and "avx" bits. Are there any
>>> other bits we need to set (or at least, not zero with this
>>> overwrite-all-bits code)?
>>>
>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "OSv Development" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Fwd: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to