Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Waldek Kozaczuk Tue, 27 Nov 2018 12:02:03 -0800

Also I wonder if the musl sources are compiled correctly as far as floating 
point handling is concerned - 
https://wiki.musl-libc.org/mathematical-library.html.


Sent from my iPhone

> On Nov 27, 2018, at 13:36, Waldek Kozaczuk <[email protected]> wrote:
> 
> Here is the disassembled portion of the code in vfprint.c:
> 480         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>  481 
>  482         do {
>  483                 *z = y;
>  484      540:       d9 c0                   fld    %st(0)
>  485      542:       d9 ad 0c e3 ff ff       fldcw  -0x1cf4(%rbp)
>  486      548:       df bd 00 e3 ff ff       fistpll -0x1d00(%rbp)
>  487      54e:       d9 ad 0e e3 ff ff       fldcw  -0x1cf2(%rbp)
>  488                 y = 1000000000*(y-*z++);
>  489      554:       48 83 c3 04             add    $0x4,%rbx
>  490                 *z = y;
>  491      558:       48 8b 85 00 e3 ff ff    mov    -0x1d00(%rbp),%rax
>  492      55f:       89 43 fc                mov    %eax,-0x4(%rbx)
>  493                 y = 1000000000*(y-*z++);
>  494      562:       89 c0                   mov    %eax,%eax
>  495      564:       48 89 85 f0 e2 ff ff    mov    %rax,-0x1d10(%rbp)
>  496      56b:       df ad f0 e2 ff ff       fildll -0x1d10(%rbp)
>  497      571:       de e9                   fsubrp %st,%st(1)
>  498      573:       d8 c9                   fmul   %st(1),%st
>  499         } while (y);
>  500      575:       d9 ee                   fldz
>  501      577:       d9 c9                   fxch   %st(1)
>  502      579:       db e9                   fucomi %st(1),%st
>  503      57b:       dd d9                   fstp   %st(1)
>  504      57d:       7a c1                   jp     540 <fmt_fp+0x140>
>  505      57f:       75 bf                   jne    540 <fmt_fp+0x140>
>  506      581:       dd d8                   fstp   %st(0)
>  507      583:       dd d8                   fstp   %st(0)
>  508 
>  509         while (e2>0) {
>  510      585:       4c 8b a5 e0 e2 ff ff    mov    -0x1d20(%rbp),%r12
> 
> Obviously it does use FP registers. I wonder if there is some kind of 
> arithmetic exception encountered (overflow when multiplying) which we do not 
> handle correctly. I think we have handler that would simply abort in this 
> case.
> 
> Waldek
> 
>> On Tuesday, November 27, 2018 at 3:36:14 AM UTC-5, Nadav Har'El wrote:
>> 
>>> On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <[email protected]> wrote:
>>> Upgrading vfprintf.c to the latest version from musl did not help. So back 
>>> to square one.
>> 
>> :-(
>> 
>>> 
>>> 
>>> (gdb) bt
>>> #0  0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247
>>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>>> #2  osv::halt () at arch/x64/power.cc:24
>>> #3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at 
>>> runtime.cc:132
>>> #4  0x0000000000202765 in abort () at runtime.cc:98
>>> #5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, 
>>> ef=0xffff800003431068) at core/mmu.cc:1316
>>> #6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, 
>>> ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
>>> #7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at 
>>> arch/x64/mmu.cc:38
>>> #8  <signal handler called>
>>> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
>>> t=102) at libc/stdio/vfprintf.c:300
>>> #10 0x0000000000000000 in ?? ()
>>> (gdb) frame 9
>>> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
>>> t=102) at libc/stdio/vfprintf.c:300
>>> 300                 *z = y;
>>> (gdb) display y
>>> 1: y = 0
>> 
>> If y is really 0 (and not some sort of "almost 0 printed as 0", the loop 
>> should have stopped. But maybe it's also possible that gdb thinks it is 0 
>> because that's what the memory variable says, but the compiler actually 
>> didn't update the memory variable and only keeps y in a register.... You can 
>> disassemble the relevant part of this function to see what the compiled 
>> looks like.
>> 
>> That being said, something I don't understand: even if y is corrupted once, 
>> I would expect it to perhaps double the amount this loop tries to write. It 
>> seems the loop ran much more than that. Could y have been corrupted more 
>> than once? Could y have reached some broken state which "looks" like zero 
>> but never evaluates to be equal to zero?
>> 
>>  
>>> (gdb) p/f y
>>> $1 = 0
>>> (gdb) p z
>>> $2 = (uint32_t *) 0x200000200000
>> 
>> Interesting. This is exactly the beginning of another huge page, so it makes 
>> sense we would get a page fault at this place, after already looping many 
>> more times than we should.
>> 
>>> (gdb) p z - big // big is a beginning of the buffer that is around 1800 
>>> long 
>>> $3 = 4632
>>> (gdb) p big
>>> $4 = {0 <repeats 1828 times>}
>>> 
>>> 
>>>> The FPU bugs usually cause crashes when memory-copying code using FPU 
>>>> overwrites random parts of memory. So why here do we always get a crash in 
>>>> exactly the same place with this "z", and what overwrote it and when??? (z 
>>>> is set just a couple of lines above and increased here in a tight loop, 
>>>> what can overwrite it?
>>>> 
>>>> Another possibility is that "y" is kept in a floating point register and 
>>>> clobbered by some interrupt that doesn't save FPU state (or something like 
>>>> that) which causes the loop to continue forever. 
>>>> If you can reliably reproduce this, you can add various printouts or 
>>>> variables to help debug what goes wrong with "z" or "y".
>>>> 
>>> What is interesting when I modified vfprintf.c like the I could not 
>>> reproduce the error any more (or maybe less frequent) after around 20 runs 
>>> of ffmpeg. Normally it happens every time in the scenario #1010.
>>> 
>>> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
>>> index aac790c0..1e116038 100644
>>> --- a/libc/stdio/vfprintf.c
>>> +++ b/libc/stdio/vfprintf.c
>>> @@ -8,6 +8,7 @@
>>>  #include <inttypes.h>
>>>  #include <math.h>
>>>  #include <float.h>
>>> +#include <assert.h>
>>>  
>>>  /* Some useful macros */
>>>  
>>> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int 
>>> p, int fl, int t)
>>>         if (e2<0) a=r=z=big;
>>>         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>>>  
>>> +        int steps = 0;
>>>         do {
>>>                 *z = y;
>>>                 y = 1000000000*(y-*z++);
>>> +                steps++;
>>> +               if(steps > 2000) {
>>> +                  assert(0);
>> 
>> Because you used a function call here, the compiler is forced to save its 
>> FPU state... Can you try just "break" here and do the print or assert after 
>> the loop?
>> 
>>> +                }
>>>         } while (y);
>>>  
>>>         while (e2>0) {
>>> 
>>> Could it be because gcc optimized code differently and did not put "y" in 
>>> the FPU registerr?
>>> 
>>> BTW I wonder if this part of some optimization suggestions in #536 is any 
>>> relevant - 
>>> https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. 
>>> Would it be possible to explicitly force gcc not to use FPU register for y? 
>>> Would it degrade performance a lot? In ether case even if possible I am 
>>> guessing the underlying issue of not saving FPU state somewhere could hurt 
>>> us later in other place of the code anyway, correct? 
>> 
>> I'm less worried about fixing this specific bug, and much more worried that 
>> we have a bigger bug which can crop up randomly in other workloads...
>>  
>>>  
>>> Finally when you say FPU state got corrupted or not saved/restored do you 
>>> mean same thing? Can you elaborate?
>>> I see this struct in processor.hh:
>>> struct fpu_state {
>>>     char legacy[512];
>>>     char xsavehdr[24];
>>>     char reserved[40];
>>>     char ymm[256];
>>> } __attribute__((packed));
>>> 
>>> Is this struct getting corrupted/overwritten in memory by some code or is 
>>> it that some interrupt handler does not save/restore FPU using fpu_lock 
>>> class like it is done by interrupt, page_fault handler, signal and syscall 
>>> handler? Is this what you are saying difficult to trace?
>> 
>> The basic issue is that our kernel code uses the FPU - both for traditional 
>> FPU work (e.g., floating-point calculations in the scheduler, etc.) and for 
>> modern SIMD trickery generated by the compiler or hand-written by us (e.g., 
>> use some 128-bit FPU instruction to copy 128 bits at a time).  As I 
>> mentioned above, when the compiler calls a function, it saves whatever FPU 
>> work it was in the middle of. But when something happens asynchronously - 
>> e.g, an interrupt, exception, etc., the code called (interrupt handler, 
>> etc.) needs to save the entire FPU state because it has no idea what the 
>> user's code might be in the middle of doing.
>> 
>> We obviously have been doing this for a long time, but in some cases we 
>> forgot to save the FPU state or did it in the wrong time and this ets 
>> discovered and exposed by new compiler optimization which uses FPU in more 
>> and more unexpected places (e.g., see 
>> 4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year). 
>> Unfortunately it is possible we still have such bugs and they are hard to 
>> hunt down (the last bug I solved by reviewing disassembled code....).
>> 
>> You're right that it's also possible that we did save the FPU state 
>> correctly, but then the interrupt handler or another thread corrupted the 
>> saved buffer, so when restoring it, we restore crap. I've never seen this 
>> happening, but it definitely can...
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "OSv Development" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to