On Thursday, October 1, 2020 at 2:56:09 PM UTC-4 Nadav Har'El wrote:
> On Thu, Oct 1, 2020 at 6:41 PM Waldek Kozaczuk <[email protected]> wrote: > >> I have been trying to implement missing support of sigsetjmp/siglongjmp >> for aarch64 architecture which is critical (it looks) to get boost based >> unit tests running. To that end, I replaced the stubs of these functions in >> libc/arch/aarch64/setjmp/ and the real implementations >> in libc/arch/x64/setjmp/ to point to new musl 1.1.24 code which support >> aarch64 as well (again one more reason why I did the musl upgrade). >> >> After this change and also implementing the functionality in >> arch/aarch64/feexcept.cc (taken from newlib) the boost unit tests working >> on aarch64. >> >> However, the problem is that two unit test crash on x64: >> >> - >> >> tst-feexcept.cc >> - >> >> tst-sigaltstack.so >> >> More specifically all assertions actually pass but the tests apps crash >> due to zero address page fault: >> >> ./scripts/run.py -e '/tests/tst-sigaltstack.so' -c 1 >> OSv v0.55.0-107-g81539853 >> eth0: 192.168.122.15 >> Booted up in 169.13 ms >> Cmdline: /tests/tst-sigaltstack.so >> SUMMARY: 6 tests, 0 failures >> page fault outside application, addr: 0x0000000000000000 >> [registers] >> RIP: 0x0000000040428884 <osv::application::run_main()+68> >> RFL: 0x0000000000010202 CS: 0x0000000000000008 SS: 0x0000000000000010 >> RAX: 0x0000000000000000 RBX: 0x0000000000000000 RCX: >> 0x0000000000000001 RDX: 0x0000000000000000 >> RSI: 0x00001000000070ac RDI: 0x0000200000700e80 RBP: >> 0x0000200000700f90 R8: 0xfffffffffffff958 >> R9: 0x000000004091bb58 R10: 0x0000000000000050 R11: >> 0xffffa0007ff77400 R12: 0x0000000000000001 >> R13: 0x000000004091bb58 R14: 0x0000000000000050 R15: >> 0xffffa0007ff77400 RSP: 0x0000200000700f60 >> Aborted >> >> [backtrace] >> 0x000000004020556e <???+1075860846> >> 0x000000004039268f <page_fault+143> >> 0x0000000040391546 <???+1077482822> >> 0x0000000040428a78 <???+1078102648> >> 0x000000004045a1c5 <???+1078305221> >> 0x00000000403f7a59 <thread_main_c+41> >> 0x00000000403924c2 <???+1077486786> >> >> The gdb backtrace looks like this: >> (gdb) bt >> #0 0x0000000040398e92 in processor::cli_hlt () at >> arch/x64/processor.hh:247 >> #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 >> #2 osv::halt () at arch/x64/power.cc:26 >> #3 0x0000000040238530 in abort (fmt=fmt@entry=0x4064451f "Aborted\n") at >> runtime.cc:132 >> #4 0x0000000040202ea6 in abort () at runtime.cc:98 >> #5 0x000000004020556f in mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0) >> at core/mmu.cc:1314 >> #6 mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0) at core/mmu.cc:1308 >> #7 mmu::vm_fault (addr=0, addr@entry=88, ef=ef@entry=0xffff800000ef9068) >> at core/mmu.cc:1336 >> #8 0x0000000040392690 in page_fault (ef=0xffff800000ef9068) at >> arch/x64/mmu.cc:42 >> #9 <signal handler called> >> #10 0x0000000040428884 in osv::application::run_main (this=0x0) at >> /usr/include/c++/10/bits/stl_vector.h:918 >> #11 0x0000000040428a79 in operator() (app=<optimized out>, __closure=0x0) >> at core/app.cc:233 >> #12 _FUN () at core/app.cc:235 >> #13 0x000000004045a1c6 in operator() (__closure=0xffffa00000cf6800) at >> libc/pthread.cc:115 >> #14 std::__invoke_impl<void, pthread_private::pthread::pthread(void* >> (*)(void*), void*, sigset_t, const >> pthread_private::thread_attr*)::<lambda()>&> (__f=...) at >> /usr/include/c++/10/bits/invoke.h:60 >> #15 std::__invoke_r<void, pthread_private::pthread::pthread(void* >> (*)(void*), void*, sigset_t, const >> pthread_private::thread_attr*)::<lambda()>&> (__fn=...) at >> /usr/include/c++/10/bits/invoke.h:153 >> #16 std::_Function_handler<void(), >> pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const >> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const >> std::_Any_data &) ( >> __functor=...) at /usr/include/c++/10/bits/std_function.h:291 >> #17 0x00000000403f7a5a in sched::thread::main (this=0xffff800000ef4040) >> at core/sched.cc:1210 >> #18 sched::thread_main_c (t=0xffff800000ef4040) at >> arch/x64/arch-switch.hh:321 >> #19 0x00000000403924c3 in thread_main () at arch/x64/entry.S:113 >> >> I think the this=0x0 in frame 10 is the culprit? >> > > It definitely looks bad. It's like we took a nullptr "app" and called > app->run_main(). > > The code that called it is probably application::start(), which calls > > auto err = pthread_create(&_thread, NULL, [](void *app) -> void* { > ((application*)app)->main(); > return nullptr; > }, this); > > which means that application->start() was called with nullptr application. > > We have in app.cc a bunch of code that looks like > auto app = std::make_shared<application>(command, args, new_program, > env, > main_function_name, > post_main); > app->start(); > > Could it be that make_shared failed the allocation and returned nullptr? > You can add an assert in this code - or even in malloc - to fail > immediately on allocation failure. > > All that being said, i don't understand why this happened *after* the test > ran. Are we running another program after the test? > > Finally, the problem is *very likely* to have something to do with > *sigsetjmp*, because this is what you said below that you changed, and > these two specific tests use it. However, I have no idea why a broken > setjmp caused what you saw here. > > >> Did something write over this memory? BTW the crash for tst-feexcept.cc >> looks pretty much identical. >> >> Now, I have replaced two files in Makefile from libc to musl and added >> 3rd musl file. >> diff --git a/Makefile b/Makefile >> index 9d2997a1..f7d740b0 100644 >> --- a/Makefile >> +++ b/Makefile >> @@ -1398,8 +1398,12 @@ musl += process/wait.o >> >> musl += setjmp/$(musl_arch)/setjmp.o >> musl += setjmp/$(musl_arch)/longjmp.o >> -libc += arch/$(arch)/setjmp/siglongjmp.o >> -libc += arch/$(arch)/setjmp/sigsetjmp.o >> +#libc += arch/$(arch)/setjmp/siglongjmp.o >> +#libc += arch/$(arch)/setjmp/sigsetjmp.o >> +musl += signal/$(musl_arch)/sigsetjmp.o >> > > I looked at the difference between the libc and musl implementation of > sigsetjmp.s, and there are extensive differences I don't understand. > One of them is a difference between a "call" and a "jmp". One of them is > the use of "@PLT" in the code. > Maybe that's an ABI (stack alignment) problem? Maybe linker? > There's also a new bizarre __sigsetjmp_tail which I don't understand. > > Can you try switching those different functions one by one and perhaps > finding which one is causing the problem? > I have narrowed down the tst-feexcept.cc to have only the following two assertions on (remaining code after in main is commented out) and also have changed env to be regular variable (see below) : static sigjmp_buf env; ... int main(int argc, char **argv) { // Test that fegetexcept() does not return a negative number expect(fegetexcept() >= 0, true); // Test that *integer* division by zero generates, ironically, a SIGFPE expect(sig_check([] { printf("%d\n", 1 / zero_i()); }, SIGFPE), true); std::cout << "SUMMARY: " << tests << " tests, " << fails << " failures\n"; return fails == 0 ? 0 : 1; } And the test still crashes (and 2 assertions pass). Obviously, if I keep the 1st assertion only, it does not crash. So something to do with sig_check(). Now, I have discovered that if I comment out the invocation of f() in sig_check the test does not crash but the 2nd assertion obviously fails. template<typename Func> bool sig_check(Func f, int sig) { // save_except works around a bug in Linux glibc // (https://sourceware.org/bugzilla/show_bug.cgi?id=21035) where // longjmp resets the FPU exception mask. So if we want it to survive // the longjmp in this test, we unfortunately need to save it ourselves int save_except = fegetexcept(); if (sigsetjmp(env, 1)) { // got the signal (and automatically restored handler) fedisableexcept(FE_ALL_EXCEPT); feenableexcept(save_except); return true; } struct sigaction sa; sa.sa_flags = SA_RESETHAND; sigemptyset(&sa.sa_mask); sa.sa_handler = [] (int) { siglongjmp(env, 1); }; assert(sigaction(sig, &sa, NULL) == 0); //f(); sa.sa_handler = SIG_DFL; assert(sigaction(sig, &sa, NULL) == 0); fedisableexcept(FE_ALL_EXCEPT); feenableexcept(save_except); return false; } So maybe something is wrong with sa_handler - siglongjmp(env, 1); - which when called for division by zero exception possibly wrote over the memory. I wonder if we have a bug in our sigprocmask (libc/signal.cc) called by sigsetjmp() that then somehow affects siglongjmp? Could it be that because of some changes in sigsetjmp some destructors are called to early and memory erased because of it (if that makes any sense)? > >> +musl += signal/siglongjmp.o >> +musl += signal/sigsetjmp_tail.o >> +$(out)/musl/src/signal/sigsetjmp_tail.o: CFLAGS += --include >> libc/syscall_to_function.h >> libc += arch/$(arch)/setjmp/block.o >> ifeq ($(arch),x64) >> libc += arch/$(arch)/ucontext/getcontext.o >> >> Now here is how the 2 files look in libc: >> cat libc/arch/x64/setjmp/siglongjmp.c >> #include <setjmp.h> >> >> extern void __restore_sigs(void *set); >> >> _Noreturn void siglongjmp(sigjmp_buf buf, int ret) >> { >> if (buf->__fl) __restore_sigs(buf->__ss); >> longjmp(buf, ret); >> } >> cat libc/arch/x64/setjmp/sigsetjmp.s >> /* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT >> license */ >> .global sigsetjmp >> .global __sigsetjmp >> .type sigsetjmp,@function >> .type __sigsetjmp,@function >> __sigsetjmp: >> sigsetjmp: >> andl %esi,%esi >> movq %rsi,64(%rdi) >> jz 1f >> pushq %rdi >> leaq 72(%rdi),%rdx >> xorl %esi,%esi >> movl $2,%edi >> call sigprocmask >> popq %rdi >> 1: jmp setjmp >> >> >> And here how 2 changed and 1 extra new file looks in musl: >> cat musl/src/signal/siglongjmp.c >> #include <setjmp.h> >> #include <signal.h> >> #include "syscall.h" >> #include "pthread_impl.h" >> >> _Noreturn void siglongjmp(sigjmp_buf buf, int ret) >> { >> longjmp(buf, ret); >> } >> cat musl/src/signal/x86_64/sigsetjmp.s >> .global sigsetjmp >> .global __sigsetjmp >> .type sigsetjmp,@function >> .type __sigsetjmp,@function >> sigsetjmp: >> __sigsetjmp: >> test %esi,%esi >> jz 1f >> >> popq 64(%rdi) >> mov %rbx,72+8(%rdi) >> mov %rdi,%rbx >> >> call setjmp@PLT >> >> pushq 64(%rbx) >> mov %rbx,%rdi >> mov %eax,%esi >> mov 72+8(%rbx),%rbx >> >> .hidden __sigsetjmp_tail >> jmp __sigsetjmp_tail >> >> 1: jmp setjmp@PLT >> cat musl/src/signal/sigsetjmp_tail.c >> #include <setjmp.h> >> #include <signal.h> >> #include "syscall.h" >> >> hidden int __sigsetjmp_tail(sigjmp_buf jb, int ret) >> { >> void *p = jb->__ss; >> __syscall(SYS_rt_sigprocmask, SIG_SETMASK, ret?p:0, ret?0:p, _NSIG/8); >> return ret; >> } >> >> Please note that we are actually replacing __syscall() in the las file >> with the local function call to sigprocmask() using the --include >> libc/syscall_to_function.h trick. >> >> As one can tell the implementation of sigsetjmp has changed in musl and >> that is causing some issues on our side. Here is the link to the musl >> commit - >> https://git.musl-libc.org/cgit/musl/commit/?id=583e55122e767b1586286a0d9c35e2a4027998ab >> (redesign >> sigsetjmp so that signal mask is restored after longjmp) that changed it. >> > > This commit has a very good explanation. It's just that it will take a lot > of effort to understand it :-( From the few minutes I spent on it, I didn' > understand :-( > > >> I am not sure if I fully understand the change but it looks the return >> address is stored in a different place. Maybe because we use thread local >> variable for env causes some issues? Why are using thread local variables >> in those 2 tests? >> > > Interesting guess, but you can easily verify it by making it > non-thread-local. > I don't know why I made this variable thread-local - I think it should > have been "static". > "thread-local" can allow running this test multiple times concurrently, > but I don't see why anyone would want to do that... > > I think you can get rid of the thread-local and also have a chance to run > it on aarch64 then. > Getting rid of thread_local for env and making it static does not help on x64 - still exact same crash (which is good news in sense:-). Interestingly the same test does not crash on aarch64 now (no thread_local which is not supported yet) but some assertions fail. > > >> Now, these two tests also fail on aarch64 but for a different reason - we >> do not have TLS implemented on aarch64 but I guess if we did those tests >> would fail in similar way. >> >> Unfortunately, I do not understand signal handling on OSv and its >> limitations so I am not sure if the changes to sigsetjmp on musl side are >> actually helpful to us and we should upgrade to the newest version. Or >> maybe it would be better to keep the x64 version of it as is and only use >> (copy) the pre-583e55122e767b1586286a0d9c35e2a4027998ab-commit version of >> aarch64 sigsetjmp from musl instead which would be sad. >> >> Any ideas? >> >> Waldek >> >> -- >> You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/osv-dev/bc3d65f7-f8d1-4bf6-afeb-76be861afa25n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/osv-dev/bc3d65f7-f8d1-4bf6-afeb-76be861afa25n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/ea62e258-b3d6-4744-aa63-034ba7d2023cn%40googlegroups.com.
