Also I wonder if the musl sources are compiled correctly as far as floating point handling is concerned - https://wiki.musl-libc.org/mathematical-library.html.
Sent from my iPhone > On Nov 27, 2018, at 13:36, Waldek Kozaczuk <[email protected]> wrote: > > Here is the disassembled portion of the code in vfprint.c: > 480 else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; > 481 > 482 do { > 483 *z = y; > 484 540: d9 c0 fld %st(0) > 485 542: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp) > 486 548: df bd 00 e3 ff ff fistpll -0x1d00(%rbp) > 487 54e: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp) > 488 y = 1000000000*(y-*z++); > 489 554: 48 83 c3 04 add $0x4,%rbx > 490 *z = y; > 491 558: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax > 492 55f: 89 43 fc mov %eax,-0x4(%rbx) > 493 y = 1000000000*(y-*z++); > 494 562: 89 c0 mov %eax,%eax > 495 564: 48 89 85 f0 e2 ff ff mov %rax,-0x1d10(%rbp) > 496 56b: df ad f0 e2 ff ff fildll -0x1d10(%rbp) > 497 571: de e9 fsubrp %st,%st(1) > 498 573: d8 c9 fmul %st(1),%st > 499 } while (y); > 500 575: d9 ee fldz > 501 577: d9 c9 fxch %st(1) > 502 579: db e9 fucomi %st(1),%st > 503 57b: dd d9 fstp %st(1) > 504 57d: 7a c1 jp 540 <fmt_fp+0x140> > 505 57f: 75 bf jne 540 <fmt_fp+0x140> > 506 581: dd d8 fstp %st(0) > 507 583: dd d8 fstp %st(0) > 508 > 509 while (e2>0) { > 510 585: 4c 8b a5 e0 e2 ff ff mov -0x1d20(%rbp),%r12 > > Obviously it does use FP registers. I wonder if there is some kind of > arithmetic exception encountered (overflow when multiplying) which we do not > handle correctly. I think we have handler that would simply abort in this > case. > > Waldek > >> On Tuesday, November 27, 2018 at 3:36:14 AM UTC-5, Nadav Har'El wrote: >> >>> On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <[email protected]> wrote: >>> Upgrading vfprintf.c to the latest version from musl did not help. So back >>> to square one. >> >> :-( >> >>> >>> >>> (gdb) bt >>> #0 0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247 >>> #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 >>> #2 osv::halt () at arch/x64/power.cc:24 >>> #3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at >>> runtime.cc:132 >>> #4 0x0000000000202765 in abort () at runtime.cc:98 >>> #5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, >>> ef=0xffff800003431068) at core/mmu.cc:1316 >>> #6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, >>> ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330 >>> #7 0x00000000003a222c in page_fault (ef=0xffff800003431068) at >>> arch/x64/mmu.cc:38 >>> #8 <signal handler called> >>> #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, >>> t=102) at libc/stdio/vfprintf.c:300 >>> #10 0x0000000000000000 in ?? () >>> (gdb) frame 9 >>> #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, >>> t=102) at libc/stdio/vfprintf.c:300 >>> 300 *z = y; >>> (gdb) display y >>> 1: y = 0 >> >> If y is really 0 (and not some sort of "almost 0 printed as 0", the loop >> should have stopped. But maybe it's also possible that gdb thinks it is 0 >> because that's what the memory variable says, but the compiler actually >> didn't update the memory variable and only keeps y in a register.... You can >> disassemble the relevant part of this function to see what the compiled >> looks like. >> >> That being said, something I don't understand: even if y is corrupted once, >> I would expect it to perhaps double the amount this loop tries to write. It >> seems the loop ran much more than that. Could y have been corrupted more >> than once? Could y have reached some broken state which "looks" like zero >> but never evaluates to be equal to zero? >> >> >>> (gdb) p/f y >>> $1 = 0 >>> (gdb) p z >>> $2 = (uint32_t *) 0x200000200000 >> >> Interesting. This is exactly the beginning of another huge page, so it makes >> sense we would get a page fault at this place, after already looping many >> more times than we should. >> >>> (gdb) p z - big // big is a beginning of the buffer that is around 1800 >>> long >>> $3 = 4632 >>> (gdb) p big >>> $4 = {0 <repeats 1828 times>} >>> >>> >>>> The FPU bugs usually cause crashes when memory-copying code using FPU >>>> overwrites random parts of memory. So why here do we always get a crash in >>>> exactly the same place with this "z", and what overwrote it and when??? (z >>>> is set just a couple of lines above and increased here in a tight loop, >>>> what can overwrite it? >>>> >>>> Another possibility is that "y" is kept in a floating point register and >>>> clobbered by some interrupt that doesn't save FPU state (or something like >>>> that) which causes the loop to continue forever. >>>> If you can reliably reproduce this, you can add various printouts or >>>> variables to help debug what goes wrong with "z" or "y". >>>> >>> What is interesting when I modified vfprintf.c like the I could not >>> reproduce the error any more (or maybe less frequent) after around 20 runs >>> of ffmpeg. Normally it happens every time in the scenario #1010. >>> >>> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c >>> index aac790c0..1e116038 100644 >>> --- a/libc/stdio/vfprintf.c >>> +++ b/libc/stdio/vfprintf.c >>> @@ -8,6 +8,7 @@ >>> #include <inttypes.h> >>> #include <math.h> >>> #include <float.h> >>> +#include <assert.h> >>> >>> /* Some useful macros */ >>> >>> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int >>> p, int fl, int t) >>> if (e2<0) a=r=z=big; >>> else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; >>> >>> + int steps = 0; >>> do { >>> *z = y; >>> y = 1000000000*(y-*z++); >>> + steps++; >>> + if(steps > 2000) { >>> + assert(0); >> >> Because you used a function call here, the compiler is forced to save its >> FPU state... Can you try just "break" here and do the print or assert after >> the loop? >> >>> + } >>> } while (y); >>> >>> while (e2>0) { >>> >>> Could it be because gcc optimized code differently and did not put "y" in >>> the FPU registerr? >>> >>> BTW I wonder if this part of some optimization suggestions in #536 is any >>> relevant - >>> https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. >>> Would it be possible to explicitly force gcc not to use FPU register for y? >>> Would it degrade performance a lot? In ether case even if possible I am >>> guessing the underlying issue of not saving FPU state somewhere could hurt >>> us later in other place of the code anyway, correct? >> >> I'm less worried about fixing this specific bug, and much more worried that >> we have a bigger bug which can crop up randomly in other workloads... >> >>> >>> Finally when you say FPU state got corrupted or not saved/restored do you >>> mean same thing? Can you elaborate? >>> I see this struct in processor.hh: >>> struct fpu_state { >>> char legacy[512]; >>> char xsavehdr[24]; >>> char reserved[40]; >>> char ymm[256]; >>> } __attribute__((packed)); >>> >>> Is this struct getting corrupted/overwritten in memory by some code or is >>> it that some interrupt handler does not save/restore FPU using fpu_lock >>> class like it is done by interrupt, page_fault handler, signal and syscall >>> handler? Is this what you are saying difficult to trace? >> >> The basic issue is that our kernel code uses the FPU - both for traditional >> FPU work (e.g., floating-point calculations in the scheduler, etc.) and for >> modern SIMD trickery generated by the compiler or hand-written by us (e.g., >> use some 128-bit FPU instruction to copy 128 bits at a time). As I >> mentioned above, when the compiler calls a function, it saves whatever FPU >> work it was in the middle of. But when something happens asynchronously - >> e.g, an interrupt, exception, etc., the code called (interrupt handler, >> etc.) needs to save the entire FPU state because it has no idea what the >> user's code might be in the middle of doing. >> >> We obviously have been doing this for a long time, but in some cases we >> forgot to save the FPU state or did it in the wrong time and this ets >> discovered and exposed by new compiler optimization which uses FPU in more >> and more unexpected places (e.g., see >> 4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year). >> Unfortunately it is possible we still have such bugs and they are hard to >> hunt down (the last bug I solved by reviewing disassembled code....). >> >> You're right that it's also possible that we did save the FPU state >> correctly, but then the interrupt handler or another thread corrupted the >> saved buffer, so when restoring it, we restore crap. I've never seen this >> happening, but it definitely can... > > -- > You received this message because you are subscribed to a topic in the Google > Groups "OSv Development" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
