Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Waldek Kozaczuk Mon, 26 Nov 2018 20:52:41 -0800

Upgrading vfprintf.c to the latest version from musl did not help. So back 
to square one.


On Monday, November 26, 2018 at 9:27:43 AM UTC-5, Nadav Har'El wrote:
>
> On Thu, Nov 22, 2018 at 9:55 PM Waldek Kozaczuk <[email protected] 
> <javascript:>> wrote:
>
>> I see this crash:
>>
>> frame=  275 fps= 47 q=-0.0 size=     520kB time=00:00:1pa2ge. f2a1 ublitt 
>> roauttes=i d3e4 8a.p6pklbiictast/iso ns,p eeadd=d2r200001d00000 0 0
>> [registers]
>> RIP: 0x000000000044ef5f <???+4517727>
>> RFL: 0x0000000000010202  CS:  0x0000000000000008  SS:  0x0000000000000010
>> RAX: 0x8000000000000000  RBX: 0x0000200001d00004  RCX: 
>> 0x0000000000000002  RDX: 0x0000200001cff66c
>> RSI: 0xfffffffffffffffc  RDI: 0x8000000000000000  RBP: 
>> 0x0000200001cff7a0  R8:  0x0000000000004000
>> R9:  0x00000000ffffffe5  R10: 0x0000000000004000  R11: 
>> 0x8000000000000000  R12: 0x0000000000000000
>> R13: 0x00000000ffffffe5  R14: 0x0000000000004000  R15: 
>> 0x8000000000000000  RSP: 0x0000200001cfda50
>> Aborted
>>
>> [backtrace]
>> 0x0000000000346ce2 <???+3435746>
>> 0x0000000000347946 <mmu::vm_fault(unsigned long, exception_frame*)+310>
>> 0x00000000003a222b <page_fault+123>
>> 0x00000000003a10a6 <???+3805350>
>>
>> Please note that ffmpeg is constantly printing to screen (vga or serial 
>> console?) some output about progress.
>>
>> Once connected to gdb I see this stacktrace:
>>
>> (gdb) bt
>> #0  0x00000000003a83d2 in processor::cli_hlt () at 
>> arch/x64/processor.hh:247
>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>> #2  osv::halt () at arch/x64/power.cc:24
>> #3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63095b "Aborted\n") at 
>> runtime.cc:132
>> #4  0x0000000000202765 in abort () at runtime.cc:98
>> #5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, 
>> ef=0xffff800006550068) at core/mmu.cc:1316
>> #6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184402497536, 
>> ef=ef@entry=0xffff800006550068) at core/mmu.cc:1330
>> #7  0x00000000003a222c in page_fault (ef=0xffff800006550068) at 
>> arch/x64/mmu.cc:38
>> #8  <signal handler called>
>> #9  0x000000000044ef5f in fmt_fp (f=0x200001cffa50, y=0, w=0, p=2, fl=0, 
>> t=102) at libc/stdio/vfprintf.c:300
>> #10 0x0000000000000000 in ?? ()
>>
>> I wonder if this is related to 
>> https://github.com/cloudius-systems/osv/issues/1010 (though no 
>> httpserver at all) and this 
>> https://github.com/cloudius-systems/osv/issues/536.
>>
>> Please note that the common thing between all these stack traces is 
>> fmt_fp() function in libc/stdio/vfprintf.c:300. Coincidence?
>>
>
> Interesting. Please open a new issue about it but link to #1010 and #536, 
> which I agree - are probably all exactly the same bug.
>
> Such crashes always smell like FPU-state-saving bugs but in this case 
> (which are notoriously difficult to fix), but I wonder... 
> The code in line 300 is:
>
>        do {
>                 *z = y;
>                 y = 1000000000*(y-*z++);
>         } while (y);
>   
>     
> Can you print "y" and "z" with the debugger?
>
>
(gdb) bt
#0  0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at 
runtime.cc:132
#4  0x0000000000202765 in abort () at runtime.cc:98
#5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, 
ef=0xffff800003431068) at core/mmu.cc:1316
#6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, 
ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
#7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at 
arch/x64/mmu.cc:38
#8  <signal handler called>
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
t=102) at libc/stdio/vfprintf.c:300
#10 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, 
t=102) at libc/stdio/vfprintf.c:300
300 *z = y;
(gdb) display y
1: y = 0
(gdb) p/f y
$1 = 0
(gdb) p z
$2 = (uint32_t *) 0x200000200000
(gdb) p z - big // big is a beginning of the buffer that is around 1800 
long 
$3 = 4632
(gdb) p big
$4 = {0 <repeats 1828 times>}


The FPU bugs usually cause crashes when memory-copying code using FPU 
> overwrites random parts of memory. So why here do we always get a crash in 
> exactly the same place with this "z", and what overwrote it and when??? (z 
> is set just a couple of lines above and increased here in a tight loop, 
> what can overwrite it?
>
> Another possibility is that "y" is kept in a floating point register and 
> clobbered by some interrupt that doesn't save FPU state (or something like 
> that) which causes the loop to continue forever. 
> If you can reliably reproduce this, you can add various printouts or 
> variables to help debug what goes wrong with "z" or "y".
>
> What is interesting when I modified vfprintf.c like the I could not 
reproduce the error any more (or maybe less frequent) after around 20 runs 
of ffmpeg. Normally it happens every time in the scenario #1010.

diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
index aac790c0..1e116038 100644
--- a/libc/stdio/vfprintf.c
+++ b/libc/stdio/vfprintf.c
@@ -8,6 +8,7 @@
 #include <inttypes.h>
 #include <math.h>
 #include <float.h>
+#include <assert.h>
 
 /* Some useful macros */
 
@@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int 
p, int fl, int t)
        if (e2<0) a=r=z=big;
        else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
 
+        int steps = 0;
        do {
                *z = y;
                y = 1000000000*(y-*z++);
+                steps++;
+               if(steps > 2000) {
+                  assert(0);
+                }
        } while (y);
 
        while (e2>0) {

Could it be because gcc optimized code differently and did not put "y" in 
the FPU registerr?

BTW I wonder if this part of some optimization suggestions in #536 is any 
relevant 
- https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. 
Would it be possible to explicitly force gcc not to use FPU register for y? 
Would it degrade performance a lot? In ether case even if possible I am 
guessing the underlying issue of not saving FPU state somewhere could hurt 
us later in other place of the code anyway, correct? 
 
Finally when you say FPU state got corrupted or not saved/restored do you 
mean same thing? Can you elaborate?
I see this struct in processor.hh:
struct fpu_state {
    char legacy[512];
    char xsavehdr[24];
    char reserved[40];
    char ymm[256];
} __attribute__((packed));

Is this struct getting corrupted/overwritten in memory by some code or is 
it that some interrupt handler does not save/restore FPU using fpu_lock 
class like it is done by interrupt, page_fault handler, signal and syscall 
handler? Is this what you are saying difficult to trace?

Finally, I'm not familiar with this code - this can also be a musl bug - 
> see for example https://www.openwall.com/lists/musl/2014/03/22/7 which 
> says MUSL did have known bugs in this area. You should be able to fairly 
> easily port a new musl's vfprintf.c to OSv (do diff libc/stdio/vfprintf.c 
> musl/src/stdio/vfprintf.c to see what were the only changes OSv 
> deliberately did to musl's vfprintf.c).
>
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to