Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Waldek Kozaczuk Tue, 27 Nov 2018 10:36:44 -0800

Here is the disassembled portion of the code in vfprint.c:
480         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
 481 
 482         do {
 483                 *z = y;
 484      540:       d9 c0                   fld    %st(0)
 485      542:       d9 ad 0c e3 ff ff       fldcw  -0x1cf4(%rbp)
 486      548:       df bd 00 e3 ff ff       fistpll -0x1d00(%rbp)
 487      54e:       d9 ad 0e e3 ff ff       fldcw  -0x1cf2(%rbp)
 488                 y = 1000000000*(y-*z++);
 489      554:       48 83 c3 04             add    $0x4,%rbx
 490                 *z = y;
 491      558:       48 8b 85 00 e3 ff ff    mov    -0x1d00(%rbp),%rax
 492      55f:       89 43 fc                mov    %eax,-0x4(%rbx)
 493                 y = 1000000000*(y-*z++);
 494      562:       89 c0                   mov    %eax,%eax
 495      564:       48 89 85 f0 e2 ff ff    mov    %rax,-0x1d10(%rbp)
 496      56b:       df ad f0 e2 ff ff       fildll -0x1d10(%rbp)
 497      571:       de e9                   fsubrp %st,%st(1)
 498      573:       d8 c9                   fmul   %st(1),%st
 499         } while (y);
 500      575:       d9 ee                   fldz
 501      577:       d9 c9                   fxch   %st(1)
 502      579:       db e9                   fucomi %st(1),%st
 503      57b:       dd d9                   fstp   %st(1)
 504      57d:       7a c1                   jp     540 <fmt_fp+0x140>
 505      57f:       75 bf                   jne    540 <fmt_fp+0x140>
 506      581:       dd d8                   fstp   %st(0)
 507      583:       dd d8                   fstp   %st(0)
 508 
 509         while (e2>0) {
 510      585:       4c 8b a5 e0 e2 ff ff    mov    -0x1d20(%rbp),%r12


Obviously it does use FP registers. I wonder if there is some kind of 
arithmetic exception encountered (overflow when multiplying) which we do 
not handle correctly. I think we have handler that would simply abort in 
this case.

Waldek

On Tuesday, November 27, 2018 at 3:36:14 AM UTC-5, Nadav Har'El wrote:
>
>
> On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <[email protected] 
> <javascript:>> wrote:
>
>> Upgrading vfprintf.c to the latest version from musl did not help. So 
>> back to square one.
>>
>
> :-(
>
>
>>
>> (gdb) bt
>> #0  0x00000000003a83d2 in processor::cli_hlt () at 
>> arch/x64/processor.hh:247
>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>> #2  osv::halt () at arch/x64/power.cc:24
>> #3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at 
>> runtime.cc:132
>> #4  0x0000000000202765 in abort () at runtime.cc:98
>> #5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, 
>> ef=0xffff800003431068) at core/mmu.cc:1316
>> #6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, 
>> ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
>> #7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at 
>> arch/x64/mmu.cc:38
>> #8  <signal handler called>
>> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
>> t=102) at libc/stdio/vfprintf.c:300
>> #10 0x0000000000000000 in ?? ()
>> (gdb) frame 9
>> #9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
>> t=102) at libc/stdio/vfprintf.c:300
>> 300 *z = y;
>> (gdb) display y
>> 1: y = 0
>>
>
> If y is really 0 (and not some sort of "almost 0 printed as 0", the loop 
> should have stopped. But maybe it's also possible that gdb thinks it is 0 
> because that's what the memory variable says, but the compiler actually 
> didn't update the memory variable and only keeps y in a register.... You 
> can disassemble the relevant part of this function to see what the compiled 
> looks like.
>
> That being said, something I don't understand: even if y is corrupted 
> once, I would expect it to perhaps double the amount this loop tries to 
> write. It seems the loop ran much more than that. Could y have been 
> corrupted more than once? Could y have reached some broken state which 
> "looks" like zero but never evaluates to be equal to zero?
>
>  
>
>> (gdb) p/f y
>> $1 = 0
>> (gdb) p z
>> $2 = (uint32_t *) 0x200000200000
>>
>
> Interesting. This is exactly the beginning of another huge page, so it 
> makes sense we would get a page fault at this place, after already looping 
> many more times than we should.
>
> (gdb) p z - big // big is a beginning of the buffer that is around 1800 
>> long 
>> $3 = 4632
>> (gdb) p big
>> $4 = {0 <repeats 1828 times>}
>>
>>
>> The FPU bugs usually cause crashes when memory-copying code using FPU 
>>> overwrites random parts of memory. So why here do we always get a crash in 
>>> exactly the same place with this "z", and what overwrote it and when??? (z 
>>> is set just a couple of lines above and increased here in a tight loop, 
>>> what can overwrite it?
>>>
>>> Another possibility is that "y" is kept in a floating point register and 
>>> clobbered by some interrupt that doesn't save FPU state (or something like 
>>> that) which causes the loop to continue forever. 
>>> If you can reliably reproduce this, you can add various printouts or 
>>> variables to help debug what goes wrong with "z" or "y".
>>>
>>> What is interesting when I modified vfprintf.c like the I could not 
>> reproduce the error any more (or maybe less frequent) after around 20 runs 
>> of ffmpeg. Normally it happens every time in the scenario #1010.
>>
>> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
>> index aac790c0..1e116038 100644
>> --- a/libc/stdio/vfprintf.c
>> +++ b/libc/stdio/vfprintf.c
>> @@ -8,6 +8,7 @@
>>  #include <inttypes.h>
>>  #include <math.h>
>>  #include <float.h>
>> +#include <assert.h>
>>  
>>  /* Some useful macros */
>>  
>> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int 
>> p, int fl, int t)
>>         if (e2<0) a=r=z=big;
>>         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>>  
>> +        int steps = 0;
>>         do {
>>                 *z = y;
>>                 y = 1000000000*(y-*z++);
>> +                steps++;
>> +               if(steps > 2000) {
>> +                  assert(0);
>>
>
> Because you used a function call here, the compiler is forced to save its 
> FPU state... Can you try just "break" here and do the print or assert after 
> the loop?
>
> +                }
>>         } while (y);
>>  
>>         while (e2>0) {
>>
>> Could it be because gcc optimized code differently and did not put "y" in 
>> the FPU registerr?
>>
>> BTW I wonder if this part of some optimization suggestions in #536 is any 
>> relevant - 
>> https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. 
>> Would it be possible to explicitly force gcc not to use FPU register for y? 
>> Would it degrade performance a lot? In ether case even if possible I am 
>> guessing the underlying issue of not saving FPU state somewhere could hurt 
>> us later in other place of the code anyway, correct? 
>>
>
> I'm less worried about fixing this specific bug, and much more worried 
> that we have a bigger bug which can crop up randomly in other workloads...
>  
>
>>  
>> Finally when you say FPU state got corrupted or not saved/restored do you 
>> mean same thing? Can you elaborate?
>> I see this struct in processor.hh:
>> struct fpu_state {
>>     char legacy[512];
>>     char xsavehdr[24];
>>     char reserved[40];
>>     char ymm[256];
>> } __attribute__((packed));
>>
>> Is this struct getting corrupted/overwritten in memory by some code or is 
>> it that some interrupt handler does not save/restore FPU using fpu_lock 
>> class like it is done by interrupt, page_fault handler, signal and syscall 
>> handler? Is this what you are saying difficult to trace?
>>
>
> The basic issue is that our kernel code uses the FPU - both for 
> traditional FPU work (e.g., floating-point calculations in the scheduler, 
> etc.) and for modern SIMD trickery generated by the compiler or 
> hand-written by us (e.g., use some 128-bit FPU instruction to copy 128 bits 
> at a time).  As I mentioned above, when the compiler calls a function, it 
> saves whatever FPU work it was in the middle of. But when something happens 
> asynchronously - e.g, an interrupt, exception, etc., the code called 
> (interrupt handler, etc.) needs to save the entire FPU state because it has 
> no idea what the user's code might be in the middle of doing.
>
> We obviously have been doing this for a long time, but in some cases we 
> forgot to save the FPU state or did it in the wrong time and this ets 
> discovered and exposed by new compiler optimization which uses FPU in more 
> and more unexpected places (e.g., see 
> 4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year). 
> Unfortunately it is possible we still have such bugs and they are hard to 
> hunt down (the last bug I solved by reviewing disassembled code....).
>
> You're right that it's also possible that we did save the FPU state 
> correctly, but then the interrupt handler or another thread corrupted the 
> saved buffer, so when restoring it, we restore crap. I've never seen this 
> happening, but it definitely can...
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to