Forgot to reply to entire group. ---------- Forwarded message --------- From: Waldek Kozaczuk <[email protected]> Date: Sun, Dec 2, 2018 at 16:26 Subject: Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg To: Nadav Har'El <[email protected]>
I think I have good news but I am adding my comments inlined. On Sun, Dec 2, 2018 at 7:08 AM Nadav Har'El <[email protected]> wrote: > On Sat, Dec 1, 2018 at 6:26 PM Waldek Kozaczuk <[email protected]> > wrote: > >> Here we go again ... I was thinking more about it. >> >> First of regarding potential corruption of FPU state during involuntary >> interrupt task switch, unless my code that verifies it is not the case >> using the checksum logic (see my previous email with the patch), there >> could be another possibility - we are telling xsave/xstor to save/restore >> to/from wrong location in memory. Meaning that whatever got saved stays not >> corrupted but somehow we are restoring wrong data for whatever reason. >> Probably very unlikely. >> > > Yes, I think it is unlikely - since you are checking the checksum, it > means we are restoring valid saved state, but someone else's saved state. I > don't have an idea how this can happen. > > If we rule out corruption of the saved FPU state, I can think of three > remaining options: > > 1. This is not a saved-FPU-state corruption - printf() really goes into an > endless loop when it gets a certain input. I'm still hoping you can get > more detailed printouts on what exactly what printf()'s *input* when it > went into this endless loop. One thing you can do is to have a variable > (perhaps thread-local global if you don't want it optimized out?) holding > that printf_fp's last input, and when you get a crash, print this global. > Or something. > 2. This is a situation of some asynchronous even where we ran a handler - > especially the scheduler which uses the FPU - but we *forgot* to save the > FPU state. I don't know what this can be. This can perhaps be debugged with > tracepoints - enable various tracepoints, especially the scheduler, and fpu > saves/restores, and after the crash look at the tracepoint log. Can we find > unexpected scheduler calls without an fpu save? > > 3. This is a situation where we saved the FPU, but *after* doing some FPU > work. We had and fix this kind of issue in > 4349bfd3583df03d44cda480049e630081d3a20b but it's more surprising to see > this kind of with x87 registers, not SSE, I can't think of a reason why the > compiler will touch these registers without our intent. > > >> Secondly I think that part of our theory, that this famous loop in >> vfprintf loops infinitely until tries to access unmapped page of memory, is >> very likely correct. But it might happening for different reason. So far we >> thought that simply the number is NaN or corrupt that makes FPU Instruction >> calculate NAN forever. But what if for whatever reason we are experiencing >> FPU stack overflow - all registers are taken and any instruction trying to >> load data into it keeps failing because it is full. >> > > This is an interesting point. The eight x87 registers are freely > addressable (st0 through st7), but can *also* be used as a stack. The > current position is stored as three bits in the floating-point status word > (FSW), and the fxsave (or newer xsave...) saves this as well on involuntary > context switches. > Here is extra evidence that seems to validate at least some of the elements of my theory. When I add this to vfprintf: > diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c > > index aac790c0..5cd31b3e 100644 > > --- a/libc/stdio/vfprintf.c > > +++ b/libc/stdio/vfprintf.c > > @@ -209,6 +209,9 @@ static char *fmt_u(uintmax_t x, char *s) > > > > static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t) > > { > > + uint16_t fpu_status; > > + asm volatile ("fnstsw %0" : "=m" (fpu_status)); > > + > > uint32_t big[(LDBL_MAX_EXP+LDBL_MANT_DIG)/9+1]; > > uint32_t *a, *d, *r, *z; > > int e2=0, e, i, j, l; > > @@ -654,8 +657,16 @@ static int printf_core(FILE *f, const char *fmt, >> va_list *ap, union arg *nl_arg, > > return 1; > > } > > > > +//The TLS version cannot be printed by GDB: > > +//(gdb) p vfprintf_fpu_status > > +// Cannot find thread-local storage for Thread 1, executable file >> build/release.x64/loader.elf: > > +// TLS not supported or disabled on this target > > +//static __thread uint16_t vfprintf_fpu_status = 0; > > +static uint16_t vfprintf_fpu_status = 0; > > int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap) > > { > > + asm volatile ("fnstsw %0" : "=m" (vfprintf_fpu_status)); > > + > > va_list ap2; > > int nl_type[NL_ARGMAX+1] = {0}; > > union arg nl_arg[NL_ARGMAX+1]; > > > and capture both variable (FPU status word) after crash: > (gdb) bt > > #0 0x00000000003a8542 in processor::cli_hlt () at >> arch/x64/processor.hh:247 > > #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 > > #2 osv::halt () at arch/x64/power.cc:24 > > #3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63199b "Aborted\n") at >> runtime.cc:132 > > #4 0x0000000000202765 in abort () at runtime.cc:98 > > #5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, >> ef=0xffff800004b8f068) at core/mmu.cc:1316 > > #6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, >> ef=ef@entry=0xffff800004b8f068) at core/mmu.cc:1330 > > #7 0x00000000003a239c in page_fault (ef=0xffff800004b8f068) at >> arch/x64/mmu.cc:38 > > #8 <signal handler called> > > #9 0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, >> t=102) at libc/stdio/vfprintf.c:303 > > #10 0x0000000000000000 in ?? () > > (gdb) frame 9 > > #9 0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, >> t=102) at libc/stdio/vfprintf.c:303 > > 303 *z = y; > > (gdb) p/t fpu_status > > *$1 = 1100001* > > (gdb) p/t vfprintf_fpu_status > > *$2 = 11000001101001* > > > According to this http://www.ray.masmcode.com/tutorial/fpuchap1.htm#sword - FPU status word at the beginning of vfprint does not look anything clean. So something left it not clean before. > > The calling convention when calling a function (i.e., voluntary context > switch) is that this stack is popped. The same is true on return (except > st0 can be used for returning a floating-point value, if this is a function > returning a floating point value). I don't know what will happen if a > function violates this convention. > > >> I believe this exception is disabled but I can enable it to see if I am >> correct and it happens somewhere in the code. Now what if after voluntary >> or involuntary thread switch the FPU stack is full and the thread switched >> to simply does not know it. Whose responsibility is in this case to make >> sure that FPU is in clean state (empty)? >> > > When we switch *back* to an old thread, we should switch to its old FPU > stack, including its old position in the x87 stack, so if it's full, it's > this thread's fault. > But on a *new* thread, we do need to make sure it got a clean slate. We > had a bug in this long ago, which 0e8d9b5e0a1af02a31799fe1d40e434bc2c2a5fb > fixed - thread_main_c() now calls processor::init_fpu(). This calls the > FNINIT instruction, which among other things, resets the stack postion (in > FSW) to 0. > > But now that I think about this, in case of an involuntary switch to the > kernel (e.g., interrupt) what FPU state does the kernel code get? What if > the kernel gets a full fpu stack and then can't do its own FPU > calculations? Should fpu_lock() run fpu_init() after saving the FPU state? > In any case, I don't think this has anything to do with the problem at > hand. I'll open a separate issue. > > >> Maybe the involuntary switch has/should have two sub-cases: >> 1) the thread we are switching to before was also involuntarily switched >> and we need to restore its FPU state from interrupt stack which we do >> 2) the thread we are switching to voluntarily yielded and ... I think FPU >> should be in somewhat clean state, at least be empty unless something left >> it full before, right? >> > > I think you may have indeed found a problem (I'm not sure it's related to > the bug we're chasing, but a problem nontheless, and *may* be related): > > We already have these two cases, implicitly: As I explained in a previous > mail, when we context switch *back* to a thread, we get back to where it > was when it gave up the CPU: > 1. If this was an involuntary context switch, we find ourselves returning > (very belatedly) from the scheduler to the end of the interrupt() function, > and now this function ends and pops the saved fpu (from the thread's > interrupt stack), and finally returns from interrupt() to the original > thread. > 2. If this was a voluntary context switch, the scheduler returns into > whatever function which called it (sleep(), read(), etc.) and that returns > to. The floating point state was never saved (it doesn't need to be, > because a function was called) but, on return, it was never restored. It > just gets a copy of whatever junk was there from the previous thread or OSv > code that ran. The junk st0..st7 registers are fine - the caller cannot > count on anything in there too. But the control registers might now have > junk as well, and that OSv does have to restore, and that it does since > commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never > restore the floating point *status* word (FSW). So if one thread was > interrupted with a full fp stack, another thread we switch to might be > returned-to with a full stack. > > *You can try adding an FNINIT in arch-switch.hh *before* > processor::fldcw(fpucw) and see if it fixes anything.* > > Indeed this what I did: > diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh > > index e3769540..7cd8d181 100644 > > --- a/arch/x64/arch-switch.hh > > +++ b/arch/x64/arch-switch.hh > > @@ -103,6 +103,7 @@ void thread::switch_to() > > [rip]"i"(offsetof(thread_state, rip)) > > : "rbx", "rdx", "rsi", "rdi", "r8", "r9", > > "r10", "r11", "r12", "r13", "r14", "r15", "memory"); > > *+ asm volatile("fninit");* > > processor::fldcw(fpucw); > > processor::ldmxcsr(mxcsr); > > } > > > *And at least the issue described in #1010 went away based on around 30-40 executions which used to fail almost every time. So I think is very good news !!! * By the way, to be honest, all this discussion is about x87. I don't know > what is the modern SSE equivalent :-( > > Finally, it may indeed make sense to enable fpu stack overflow exceptions > (and their SSE equivalent??) to see if we have any of these problems. > > > > > >> My thinking might be all wrong. But just food for thought. >> >> Waldek >> >> Sent from my iPhone >> >> On Dec 1, 2018, at 00:46, Waldek Kozaczuk <[email protected]> wrote: >> >> My theory has fallen apart: I modified httpserver like this to pass >> constant float value of 0.5 and still getting same crash: >> >> index 358a86c6..126ab5f4 100644 >> --- a/modules/httpserver-api/api/os.cc >> +++ b/modules/httpserver-api/api/os.cc >> @@ -131,7 +131,7 @@ void init(routes& routes) >> thread.migrations = t.stat_migrations.get(); >> thread.preemptions = t.stat_preemptions.get(); >> thread.name = t.name(); >> - thread.priority = t.priority(); >> + thread.priority = 1.0;//t.priority(); >> thread.stack_size = t.get_stack_info().size; >> thread.status = t.get_status(); >> threads.list.push(thread); >> >> I even created separate test program that keeps sprintf'ing both NAN and >> regular float numbers using 5 threads to replace ffmpeg and I could not >> replicate same program running exact same scenario with httpserver. >> >> As I wrote in other thread I added checksum logic to detect FPU state >> corruption that unfortunately does not detect anything. But is this logic >> actually correct? >> >> Ediff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc >> index bb96af3d..34bf6081 100644 >> --- a/arch/x64/arch-cpu.cc >> +++ b/arch/x64/arch-cpu.cc >> @@ -42,6 +42,14 @@ exception_guard::~exception_guard() >> sched::cpu::current()->arch.exit_exception(); >> } >> >> +static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) { >> + u64 val = 0; >> + char* data = s->legacy; >> + for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++) >> + val += data[i]; >> + return val; >> +} >> + >> extern "C" >> [[gnu::target("no-sse")]] >> void fpu_state_init_xsave(processor::fpu_state *s) { >> @@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) { >> extern "C" >> void fpu_state_save_xsave(processor::fpu_state *s) { >> processor::xsave(s, -1); >> + s->checksum = calculate_fpu_state_checksum(s); >> } >> >> extern "C" >> void fpu_state_save_fxsave(processor::fpu_state *s) { >> processor::fxsave(s); >> + s->checksum = calculate_fpu_state_checksum(s); >> } >> >> extern "C" >> void fpu_state_restore_xsave(processor::fpu_state *s) { >> processor::xrstor(s, -1); >> + assert(s->checksum == calculate_fpu_state_checksum(s)); >> } >> >> extern "C" >> void fpu_state_restore_fxsave(processor::fpu_state *s) { >> processor::fxrstor(s); >> + assert(s->checksum == calculate_fpu_state_checksum(s)); >> } >> >> extern "C" >> diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh >> index 6b1fbf66..f699d5b8 100644 >> --- a/arch/x64/processor.hh >> +++ b/arch/x64/processor.hh >> @@ -343,6 +343,7 @@ struct fpu_state { >> char xsavehdr[24]; >> char reserved[40]; >> char ymm[256]; >> + u64 checksum; >> } __attribute__((packed)); >> >> inline void fxsave(fpu_state* s) >> >> On Friday, November 30, 2018 at 7:43:31 PM UTC-5, Nadav Har'El wrote: >>> >>> On Thu, Nov 29, 2018 at 9:33 PM Waldek Kozaczuk <[email protected]> >>> wrote: >>> >>>> I wonder if per this - https://software.intel.com/en-us/node/523375 - >>>> we are enabling correct bits for xsave if caste is what being used here. >>>> >>>> “The second operand is a save/restore mask specifying the >>>> saved/restored extended states. The value of the mask is ANDed with >>>> XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored >>>> only if the corresponding bit of both save/restore mask and >>>> XFEATURE_ENABLED_MASK is set to '1'.” >>>> >>> >>> Interesting question. In arch/x64/arch-cpu.hh, arch_cpu::init_on_cpu(), >>> you can see: >>> >>> if (features().xsave) { >>> auto bits = xcr0_x87 | xcr0_sse; >>> if (features().avx) { >>> bits |= xcr0_avx; >>> } >>> write_xcr(xcr0, bits); >>> } >>> >>> So we're turning on the "x87", "sse", and "avx" bits. Are there any >>> other bits we need to set (or at least, not zero with this >>> overwrite-all-bits code)? >>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "OSv Development" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
