Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Nadav Har'El Tue, 27 Nov 2018 00:36:23 -0800

On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <[email protected]>
wrote:


> Upgrading vfprintf.c to the latest version from musl did not help. So back
> to square one.
>

:-(


>
> (gdb) bt
> #0  0x00000000003a83d2 in processor::cli_hlt () at
> arch/x64/processor.hh:247
> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
> #2  osv::halt () at arch/x64/power.cc:24
> #3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at
> runtime.cc:132
> #4  0x0000000000202765 in abort () at runtime.cc:98
> #5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>,
> ef=0xffff800003431068) at core/mmu.cc:1316
> #6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984,
> ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
> #7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at
> arch/x64/mmu.cc:38
> #8  <signal handler called>
> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
> t=102) at libc/stdio/vfprintf.c:300
> #10 0x0000000000000000 in ?? ()
> (gdb) frame 9
> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
> t=102) at libc/stdio/vfprintf.c:300
> 300 *z = y;
> (gdb) display y
> 1: y = 0
>

If y is really 0 (and not some sort of "almost 0 printed as 0", the loop
should have stopped. But maybe it's also possible that gdb thinks it is 0
because that's what the memory variable says, but the compiler actually
didn't update the memory variable and only keeps y in a register.... You
can disassemble the relevant part of this function to see what the compiled
looks like.

That being said, something I don't understand: even if y is corrupted once,
I would expect it to perhaps double the amount this loop tries to write. It
seems the loop ran much more than that. Could y have been corrupted more
than once? Could y have reached some broken state which "looks" like zero
but never evaluates to be equal to zero?



> (gdb) p/f y
> $1 = 0
> (gdb) p z
> $2 = (uint32_t *) 0x200000200000
>

Interesting. This is exactly the beginning of another huge page, so it
makes sense we would get a page fault at this place, after already looping
many more times than we should.

(gdb) p z - big // big is a beginning of the buffer that is around 1800
> long
> $3 = 4632
> (gdb) p big
> $4 = {0 <repeats 1828 times>}
>
>
> The FPU bugs usually cause crashes when memory-copying code using FPU
>> overwrites random parts of memory. So why here do we always get a crash in
>> exactly the same place with this "z", and what overwrote it and when??? (z
>> is set just a couple of lines above and increased here in a tight loop,
>> what can overwrite it?
>>
>> Another possibility is that "y" is kept in a floating point register and
>> clobbered by some interrupt that doesn't save FPU state (or something like
>> that) which causes the loop to continue forever.
>> If you can reliably reproduce this, you can add various printouts or
>> variables to help debug what goes wrong with "z" or "y".
>>
>> What is interesting when I modified vfprintf.c like the I could not
> reproduce the error any more (or maybe less frequent) after around 20 runs
> of ffmpeg. Normally it happens every time in the scenario #1010.
>
> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
> index aac790c0..1e116038 100644
> --- a/libc/stdio/vfprintf.c
> +++ b/libc/stdio/vfprintf.c
> @@ -8,6 +8,7 @@
>  #include <inttypes.h>
>  #include <math.h>
>  #include <float.h>
> +#include <assert.h>
>
>  /* Some useful macros */
>
> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int
> p, int fl, int t)
>         if (e2<0) a=r=z=big;
>         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>
> +        int steps = 0;
>         do {
>                 *z = y;
>                 y = 1000000000*(y-*z++);
> +                steps++;
> +               if(steps > 2000) {
> +                  assert(0);
>

Because you used a function call here, the compiler is forced to save its
FPU state... Can you try just "break" here and do the print or assert after
the loop?

+                }
>         } while (y);
>
>         while (e2>0) {
>
> Could it be because gcc optimized code differently and did not put "y" in
> the FPU registerr?
>
> BTW I wonder if this part of some optimization suggestions in #536 is any
> relevant -
> https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543.
> Would it be possible to explicitly force gcc not to use FPU register for y?
> Would it degrade performance a lot? In ether case even if possible I am
> guessing the underlying issue of not saving FPU state somewhere could hurt
> us later in other place of the code anyway, correct?
>

I'm less worried about fixing this specific bug, and much more worried that
we have a bigger bug which can crop up randomly in other workloads...


>
> Finally when you say FPU state got corrupted or not saved/restored do you
> mean same thing? Can you elaborate?
> I see this struct in processor.hh:
> struct fpu_state {
>     char legacy[512];
>     char xsavehdr[24];
>     char reserved[40];
>     char ymm[256];
> } __attribute__((packed));
>
> Is this struct getting corrupted/overwritten in memory by some code or is
> it that some interrupt handler does not save/restore FPU using fpu_lock
> class like it is done by interrupt, page_fault handler, signal and syscall
> handler? Is this what you are saying difficult to trace?
>

The basic issue is that our kernel code uses the FPU - both for traditional
FPU work (e.g., floating-point calculations in the scheduler, etc.) and for
modern SIMD trickery generated by the compiler or hand-written by us (e.g.,
use some 128-bit FPU instruction to copy 128 bits at a time).  As I
mentioned above, when the compiler calls a function, it saves whatever FPU
work it was in the middle of. But when something happens asynchronously -
e.g, an interrupt, exception, etc., the code called (interrupt handler,
etc.) needs to save the entire FPU state because it has no idea what the
user's code might be in the middle of doing.

We obviously have been doing this for a long time, but in some cases we
forgot to save the FPU state or did it in the wrong time and this ets
discovered and exposed by new compiler optimization which uses FPU in more
and more unexpected places (e.g., see
4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year).
Unfortunately it is possible we still have such bugs and they are hard to
hunt down (the last bug I solved by reviewing disassembled code....).

You're right that it's also possible that we did save the FPU state
correctly, but then the interrupt handler or another thread corrupted the
saved buffer, so when restoring it, we restore crap. I've never seen this
happening, but it definitely can...

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to