Here is the disassembled portion of the code in vfprint.c:
480 else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
481
482 do {
483 *z = y;
484 540: d9 c0 fld %st(0)
485 542: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp)
486 548: df bd 00 e3 ff ff fistpll -0x1d00(%rbp)
487 54e: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp)
488 y = 1000000000*(y-*z++);
489 554: 48 83 c3 04 add $0x4,%rbx
490 *z = y;
491 558: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax
492 55f: 89 43 fc mov %eax,-0x4(%rbx)
493 y = 1000000000*(y-*z++);
494 562: 89 c0 mov %eax,%eax
495 564: 48 89 85 f0 e2 ff ff mov %rax,-0x1d10(%rbp)
496 56b: df ad f0 e2 ff ff fildll -0x1d10(%rbp)
497 571: de e9 fsubrp %st,%st(1)
498 573: d8 c9 fmul %st(1),%st
499 } while (y);
500 575: d9 ee fldz
501 577: d9 c9 fxch %st(1)
502 579: db e9 fucomi %st(1),%st
503 57b: dd d9 fstp %st(1)
504 57d: 7a c1 jp 540 <fmt_fp+0x140>
505 57f: 75 bf jne 540 <fmt_fp+0x140>
506 581: dd d8 fstp %st(0)
507 583: dd d8 fstp %st(0)
508
509 while (e2>0) {
510 585: 4c 8b a5 e0 e2 ff ff mov -0x1d20(%rbp),%r12
Obviously it does use FP registers. I wonder if there is some kind of
arithmetic exception encountered (overflow when multiplying) which we do
not handle correctly. I think we have handler that would simply abort in
this case.
Waldek
On Tuesday, November 27, 2018 at 3:36:14 AM UTC-5, Nadav Har'El wrote:
>
>
> On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <[email protected]
> <javascript:>> wrote:
>
>> Upgrading vfprintf.c to the latest version from musl did not help. So
>> back to square one.
>>
>
> :-(
>
>
>>
>> (gdb) bt
>> #0 0x00000000003a83d2 in processor::cli_hlt () at
>> arch/x64/processor.hh:247
>> #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48
>> #2 osv::halt () at arch/x64/power.cc:24
>> #3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at
>> runtime.cc:132
>> #4 0x0000000000202765 in abort () at runtime.cc:98
>> #5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>,
>> ef=0xffff800003431068) at core/mmu.cc:1316
>> #6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984,
>> ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
>> #7 0x00000000003a222c in page_fault (ef=0xffff800003431068) at
>> arch/x64/mmu.cc:38
>> #8 <signal handler called>
>> #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
>> t=102) at libc/stdio/vfprintf.c:300
>> #10 0x0000000000000000 in ?? ()
>> (gdb) frame 9
>> #9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0,
>> t=102) at libc/stdio/vfprintf.c:300
>> 300 *z = y;
>> (gdb) display y
>> 1: y = 0
>>
>
> If y is really 0 (and not some sort of "almost 0 printed as 0", the loop
> should have stopped. But maybe it's also possible that gdb thinks it is 0
> because that's what the memory variable says, but the compiler actually
> didn't update the memory variable and only keeps y in a register.... You
> can disassemble the relevant part of this function to see what the compiled
> looks like.
>
> That being said, something I don't understand: even if y is corrupted
> once, I would expect it to perhaps double the amount this loop tries to
> write. It seems the loop ran much more than that. Could y have been
> corrupted more than once? Could y have reached some broken state which
> "looks" like zero but never evaluates to be equal to zero?
>
>
>
>> (gdb) p/f y
>> $1 = 0
>> (gdb) p z
>> $2 = (uint32_t *) 0x200000200000
>>
>
> Interesting. This is exactly the beginning of another huge page, so it
> makes sense we would get a page fault at this place, after already looping
> many more times than we should.
>
> (gdb) p z - big // big is a beginning of the buffer that is around 1800
>> long
>> $3 = 4632
>> (gdb) p big
>> $4 = {0 <repeats 1828 times>}
>>
>>
>> The FPU bugs usually cause crashes when memory-copying code using FPU
>>> overwrites random parts of memory. So why here do we always get a crash in
>>> exactly the same place with this "z", and what overwrote it and when??? (z
>>> is set just a couple of lines above and increased here in a tight loop,
>>> what can overwrite it?
>>>
>>> Another possibility is that "y" is kept in a floating point register and
>>> clobbered by some interrupt that doesn't save FPU state (or something like
>>> that) which causes the loop to continue forever.
>>> If you can reliably reproduce this, you can add various printouts or
>>> variables to help debug what goes wrong with "z" or "y".
>>>
>>> What is interesting when I modified vfprintf.c like the I could not
>> reproduce the error any more (or maybe less frequent) after around 20 runs
>> of ffmpeg. Normally it happens every time in the scenario #1010.
>>
>> diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
>> index aac790c0..1e116038 100644
>> --- a/libc/stdio/vfprintf.c
>> +++ b/libc/stdio/vfprintf.c
>> @@ -8,6 +8,7 @@
>> #include <inttypes.h>
>> #include <math.h>
>> #include <float.h>
>> +#include <assert.h>
>>
>> /* Some useful macros */
>>
>> @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int
>> p, int fl, int t)
>> if (e2<0) a=r=z=big;
>> else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
>>
>> + int steps = 0;
>> do {
>> *z = y;
>> y = 1000000000*(y-*z++);
>> + steps++;
>> + if(steps > 2000) {
>> + assert(0);
>>
>
> Because you used a function call here, the compiler is forced to save its
> FPU state... Can you try just "break" here and do the print or assert after
> the loop?
>
> + }
>> } while (y);
>>
>> while (e2>0) {
>>
>> Could it be because gcc optimized code differently and did not put "y" in
>> the FPU registerr?
>>
>> BTW I wonder if this part of some optimization suggestions in #536 is any
>> relevant -
>> https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543.
>> Would it be possible to explicitly force gcc not to use FPU register for y?
>> Would it degrade performance a lot? In ether case even if possible I am
>> guessing the underlying issue of not saving FPU state somewhere could hurt
>> us later in other place of the code anyway, correct?
>>
>
> I'm less worried about fixing this specific bug, and much more worried
> that we have a bigger bug which can crop up randomly in other workloads...
>
>
>>
>> Finally when you say FPU state got corrupted or not saved/restored do you
>> mean same thing? Can you elaborate?
>> I see this struct in processor.hh:
>> struct fpu_state {
>> char legacy[512];
>> char xsavehdr[24];
>> char reserved[40];
>> char ymm[256];
>> } __attribute__((packed));
>>
>> Is this struct getting corrupted/overwritten in memory by some code or is
>> it that some interrupt handler does not save/restore FPU using fpu_lock
>> class like it is done by interrupt, page_fault handler, signal and syscall
>> handler? Is this what you are saying difficult to trace?
>>
>
> The basic issue is that our kernel code uses the FPU - both for
> traditional FPU work (e.g., floating-point calculations in the scheduler,
> etc.) and for modern SIMD trickery generated by the compiler or
> hand-written by us (e.g., use some 128-bit FPU instruction to copy 128 bits
> at a time). As I mentioned above, when the compiler calls a function, it
> saves whatever FPU work it was in the middle of. But when something happens
> asynchronously - e.g, an interrupt, exception, etc., the code called
> (interrupt handler, etc.) needs to save the entire FPU state because it has
> no idea what the user's code might be in the middle of doing.
>
> We obviously have been doing this for a long time, but in some cases we
> forgot to save the FPU state or did it in the wrong time and this ets
> discovered and exposed by new compiler optimization which uses FPU in more
> and more unexpected places (e.g., see
> 4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year).
> Unfortunately it is possible we still have such bugs and they are hard to
> hunt down (the last bug I solved by reviewing disassembled code....).
>
> You're right that it's also possible that we did save the FPU state
> correctly, but then the interrupt handler or another thread corrupted the
> saved buffer, so when restoring it, we restore crap. I've never seen this
> happening, but it definitely can...
>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.