Upgrading vfprintf.c to the latest version from musl did not help. So back to square one.
On Monday, November 26, 2018 at 9:27:43 AM UTC-5, Nadav Har'El wrote: > > On Thu, Nov 22, 2018 at 9:55 PM Waldek Kozaczuk <[email protected] > <javascript:>> wrote: > >> I see this crash: >> >> frame= 275 fps= 47 q=-0.0 size= 520kB time=00:00:1pa2ge. f2a1 ublitt >> roauttes=i d3e4 8a.p6pklbiictast/iso ns,p eeadd=d2r200001d00000 0 0 >> [registers] >> RIP: 0x000000000044ef5f <???+4517727> >> RFL: 0x0000000000010202 CS: 0x0000000000000008 SS: 0x0000000000000010 >> RAX: 0x8000000000000000 RBX: 0x0000200001d00004 RCX: >> 0x0000000000000002 RDX: 0x0000200001cff66c >> RSI: 0xfffffffffffffffc RDI: 0x8000000000000000 RBP: >> 0x0000200001cff7a0 R8: 0x0000000000004000 >> R9: 0x00000000ffffffe5 R10: 0x0000000000004000 R11: >> 0x8000000000000000 R12: 0x0000000000000000 >> R13: 0x00000000ffffffe5 R14: 0x0000000000004000 R15: >> 0x8000000000000000 RSP: 0x0000200001cfda50 >> Aborted >> >> [backtrace] >> 0x0000000000346ce2 <???+3435746> >> 0x0000000000347946 <mmu::vm_fault(unsigned long, exception_frame*)+310> >> 0x00000000003a222b <page_fault+123> >> 0x00000000003a10a6 <???+3805350> >> >> Please note that ffmpeg is constantly printing to screen (vga or serial >> console?) some output about progress. >> >> Once connected to gdb I see this stacktrace: >> >> (gdb) bt >> #0 0x00000000003a83d2 in processor::cli_hlt () at >> arch/x64/processor.hh:247 >> #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 >> #2 osv::halt () at arch/x64/power.cc:24 >> #3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63095b "Aborted\n") at >> runtime.cc:132 >> #4 0x0000000000202765 in abort () at runtime.cc:98 >> #5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, >> ef=0xffff800006550068) at core/mmu.cc:1316 >> #6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184402497536, >> ef=ef@entry=0xffff800006550068) at core/mmu.cc:1330 >> #7 0x00000000003a222c in page_fault (ef=0xffff800006550068) at >> arch/x64/mmu.cc:38 >> #8 <signal handler called> >> #9 0x000000000044ef5f in fmt_fp (f=0x200001cffa50, y=0, w=0, p=2, fl=0, >> t=102) at libc/stdio/vfprintf.c:300 >> #10 0x0000000000000000 in ?? () >> >> I wonder if this is related to >> https://github.com/cloudius-systems/osv/issues/1010 (though no >> httpserver at all) and this >> https://github.com/cloudius-systems/osv/issues/536. >> >> Please note that the common thing between all these stack traces is >> fmt_fp() function in libc/stdio/vfprintf.c:300. Coincidence? >> > > Interesting. Please open a new issue about it but link to #1010 and #536, > which I agree - are probably all exactly the same bug. > > Such crashes always smell like FPU-state-saving bugs but in this case > (which are notoriously difficult to fix), but I wonder... > The code in line 300 is: > > do { > *z = y; > y = 1000000000*(y-*z++); > } while (y); > > > Can you print "y" and "z" with the debugger? > > (gdb) bt #0 0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247 #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 #2 osv::halt () at arch/x64/power.cc:24 #3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at runtime.cc:132 #4 0x0000000000202765 in abort () at runtime.cc:98 #5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800003431068) at core/mmu.cc:1316 #6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330 #7 0x00000000003a222c in page_fault (ef=0xffff800003431068) at arch/x64/mmu.cc:38 #8 <signal handler called> #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300 #10 0x0000000000000000 in ?? () (gdb) frame 9 #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300 300 *z = y; (gdb) display y 1: y = 0 (gdb) p/f y $1 = 0 (gdb) p z $2 = (uint32_t *) 0x200000200000 (gdb) p z - big // big is a beginning of the buffer that is around 1800 long $3 = 4632 (gdb) p big $4 = {0 <repeats 1828 times>} The FPU bugs usually cause crashes when memory-copying code using FPU > overwrites random parts of memory. So why here do we always get a crash in > exactly the same place with this "z", and what overwrote it and when??? (z > is set just a couple of lines above and increased here in a tight loop, > what can overwrite it? > > Another possibility is that "y" is kept in a floating point register and > clobbered by some interrupt that doesn't save FPU state (or something like > that) which causes the loop to continue forever. > If you can reliably reproduce this, you can add various printouts or > variables to help debug what goes wrong with "z" or "y". > > What is interesting when I modified vfprintf.c like the I could not reproduce the error any more (or maybe less frequent) after around 20 runs of ffmpeg. Normally it happens every time in the scenario #1010. diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c index aac790c0..1e116038 100644 --- a/libc/stdio/vfprintf.c +++ b/libc/stdio/vfprintf.c @@ -8,6 +8,7 @@ #include <inttypes.h> #include <math.h> #include <float.h> +#include <assert.h> /* Some useful macros */ @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t) if (e2<0) a=r=z=big; else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; + int steps = 0; do { *z = y; y = 1000000000*(y-*z++); + steps++; + if(steps > 2000) { + assert(0); + } } while (y); while (e2>0) { Could it be because gcc optimized code differently and did not put "y" in the FPU registerr? BTW I wonder if this part of some optimization suggestions in #536 is any relevant - https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. Would it be possible to explicitly force gcc not to use FPU register for y? Would it degrade performance a lot? In ether case even if possible I am guessing the underlying issue of not saving FPU state somewhere could hurt us later in other place of the code anyway, correct? Finally when you say FPU state got corrupted or not saved/restored do you mean same thing? Can you elaborate? I see this struct in processor.hh: struct fpu_state { char legacy[512]; char xsavehdr[24]; char reserved[40]; char ymm[256]; } __attribute__((packed)); Is this struct getting corrupted/overwritten in memory by some code or is it that some interrupt handler does not save/restore FPU using fpu_lock class like it is done by interrupt, page_fault handler, signal and syscall handler? Is this what you are saying difficult to trace? Finally, I'm not familiar with this code - this can also be a musl bug - > see for example https://www.openwall.com/lists/musl/2014/03/22/7 which > says MUSL did have known bugs in this area. You should be able to fairly > easily port a new musl's vfprintf.c to OSv (do diff libc/stdio/vfprintf.c > musl/src/stdio/vfprintf.c to see what were the only changes OSv > deliberately did to musl's vfprintf.c). > > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
