On Thu, Oct 1, 2020 at 6:41 PM Waldek Kozaczuk <[email protected]> wrote:

> I have been trying to implement missing support of sigsetjmp/siglongjmp
> for aarch64 architecture which is critical (it looks) to get boost based
> unit tests running. To that end, I replaced the stubs of these functions in
> libc/arch/aarch64/setjmp/ and the real implementations
> in libc/arch/x64/setjmp/ to point to new musl 1.1.24 code which support
> aarch64 as well (again one more reason why I did the musl upgrade).
>
> After this change and also implementing the functionality in
> arch/aarch64/feexcept.cc (taken from newlib) the boost unit tests working
> on aarch64.
>
> However, the problem is that two unit test crash on x64:
>
>    -
>
>    tst-feexcept.cc
>    -
>
>    tst-sigaltstack.so
>
> More specifically all assertions actually pass but the tests apps crash
> due to zero address page fault:
>
> ./scripts/run.py -e '/tests/tst-sigaltstack.so' -c 1
> OSv v0.55.0-107-g81539853
> eth0: 192.168.122.15
> Booted up in 169.13 ms
> Cmdline: /tests/tst-sigaltstack.so
> SUMMARY: 6 tests, 0 failures
> page fault outside application, addr: 0x0000000000000000
> [registers]
> RIP: 0x0000000040428884 <osv::application::run_main()+68>
> RFL: 0x0000000000010202  CS:  0x0000000000000008  SS:  0x0000000000000010
> RAX: 0x0000000000000000  RBX: 0x0000000000000000  RCX: 0x0000000000000001
> RDX: 0x0000000000000000
> RSI: 0x00001000000070ac  RDI: 0x0000200000700e80  RBP: 0x0000200000700f90
> R8:  0xfffffffffffff958
> R9:  0x000000004091bb58  R10: 0x0000000000000050  R11: 0xffffa0007ff77400
> R12: 0x0000000000000001
> R13: 0x000000004091bb58  R14: 0x0000000000000050  R15: 0xffffa0007ff77400
> RSP: 0x0000200000700f60
> Aborted
>
> [backtrace]
> 0x000000004020556e <???+1075860846>
> 0x000000004039268f <page_fault+143>
> 0x0000000040391546 <???+1077482822>
> 0x0000000040428a78 <???+1078102648>
> 0x000000004045a1c5 <???+1078305221>
> 0x00000000403f7a59 <thread_main_c+41>
> 0x00000000403924c2 <???+1077486786>
>
> The gdb backtrace looks like this:
> (gdb) bt
> #0  0x0000000040398e92 in processor::cli_hlt () at
> arch/x64/processor.hh:247
> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
> #2  osv::halt () at arch/x64/power.cc:26
> #3  0x0000000040238530 in abort (fmt=fmt@entry=0x4064451f "Aborted\n") at
> runtime.cc:132
> #4  0x0000000040202ea6 in abort () at runtime.cc:98
> #5  0x000000004020556f in mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0)
> at core/mmu.cc:1314
> #6  mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0) at core/mmu.cc:1308
> #7  mmu::vm_fault (addr=0, addr@entry=88, ef=ef@entry=0xffff800000ef9068)
> at core/mmu.cc:1336
> #8  0x0000000040392690 in page_fault (ef=0xffff800000ef9068) at
> arch/x64/mmu.cc:42
> #9  <signal handler called>
> #10 0x0000000040428884 in osv::application::run_main (this=0x0) at
> /usr/include/c++/10/bits/stl_vector.h:918
> #11 0x0000000040428a79 in operator() (app=<optimized out>, __closure=0x0)
> at core/app.cc:233
> #12 _FUN () at core/app.cc:235
> #13 0x000000004045a1c6 in operator() (__closure=0xffffa00000cf6800) at
> libc/pthread.cc:115
> #14 std::__invoke_impl<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__f=...) at
> /usr/include/c++/10/bits/invoke.h:60
> #15 std::__invoke_r<void, pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()>&> (__fn=...) at
> /usr/include/c++/10/bits/invoke.h:153
> #16 std::_Function_handler<void(), pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
> std::_Any_data &) (
>     __functor=...) at /usr/include/c++/10/bits/std_function.h:291
> #17 0x00000000403f7a5a in sched::thread::main (this=0xffff800000ef4040) at
> core/sched.cc:1210
> #18 sched::thread_main_c (t=0xffff800000ef4040) at
> arch/x64/arch-switch.hh:321
> #19 0x00000000403924c3 in thread_main () at arch/x64/entry.S:113
>
> I think the this=0x0 in frame 10 is the culprit?
>

It definitely looks bad. It's like we took a nullptr "app" and called
app->run_main().

The code that called it is probably application::start(), which calls

    auto err = pthread_create(&_thread, NULL, [](void *app) -> void* {
        ((application*)app)->main();
        return nullptr;
    }, this);

which means that application->start() was called with nullptr application.

We have in app.cc a bunch of code that looks like
    auto app = std::make_shared<application>(command, args, new_program,
env,
                                             main_function_name, post_main);
    app->start();

Could it be that make_shared failed the allocation and returned nullptr?
You can add an assert in this code - or even in malloc - to fail
immediately on allocation failure.

All that being said, i don't understand why this happened *after* the test
ran. Are we running another program after the test?

Finally, the problem is *very likely* to have something to do with
*sigsetjmp*, because this is what you said below that you changed, and
these two specific tests use it. However, I have no idea why a broken
setjmp caused what you saw here.


> Did something write over this memory? BTW the crash for tst-feexcept.cc
> looks pretty much identical.
>
> Now, I have replaced two files in Makefile from libc to musl and added 3rd
> musl file.
> diff --git a/Makefile b/Makefile
> index 9d2997a1..f7d740b0 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1398,8 +1398,12 @@ musl += process/wait.o
>
>  musl += setjmp/$(musl_arch)/setjmp.o
>  musl += setjmp/$(musl_arch)/longjmp.o
> -libc += arch/$(arch)/setjmp/siglongjmp.o
> -libc += arch/$(arch)/setjmp/sigsetjmp.o
> +#libc += arch/$(arch)/setjmp/siglongjmp.o
> +#libc += arch/$(arch)/setjmp/sigsetjmp.o
> +musl += signal/$(musl_arch)/sigsetjmp.o
>

I looked at the difference between the libc and musl implementation of
sigsetjmp.s, and there are extensive differences I don't understand.
One of them is a difference between a "call" and a "jmp". One of them is
the use of "@PLT" in the code.
Maybe that's an ABI (stack alignment) problem? Maybe linker?
There's also a new bizarre __sigsetjmp_tail which I don't understand.

Can you try switching those different functions one by one and perhaps
finding which one is causing the problem?


> +musl += signal/siglongjmp.o
> +musl += signal/sigsetjmp_tail.o
> +$(out)/musl/src/signal/sigsetjmp_tail.o: CFLAGS += --include
> libc/syscall_to_function.h
>  libc += arch/$(arch)/setjmp/block.o
>  ifeq ($(arch),x64)
>  libc += arch/$(arch)/ucontext/getcontext.o
>
> Now here is how the 2 files look in libc:
> cat libc/arch/x64/setjmp/siglongjmp.c
> #include <setjmp.h>
>
> extern void __restore_sigs(void *set);
>
> _Noreturn void siglongjmp(sigjmp_buf buf, int ret)
> {
> if (buf->__fl) __restore_sigs(buf->__ss);
> longjmp(buf, ret);
> }
> cat libc/arch/x64/setjmp/sigsetjmp.s
> /* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT
> license */
> .global sigsetjmp
> .global __sigsetjmp
> .type sigsetjmp,@function
> .type __sigsetjmp,@function
> __sigsetjmp:
> sigsetjmp:
> andl %esi,%esi
> movq %rsi,64(%rdi)
> jz 1f
> pushq %rdi
> leaq 72(%rdi),%rdx
> xorl %esi,%esi
> movl $2,%edi
> call sigprocmask
> popq %rdi
> 1: jmp setjmp
>
>
> And here how 2 changed and 1 extra new file looks in musl:
> cat musl/src/signal/siglongjmp.c
> #include <setjmp.h>
> #include <signal.h>
> #include "syscall.h"
> #include "pthread_impl.h"
>
> _Noreturn void siglongjmp(sigjmp_buf buf, int ret)
> {
> longjmp(buf, ret);
> }
> cat musl/src/signal/x86_64/sigsetjmp.s
> .global sigsetjmp
> .global __sigsetjmp
> .type sigsetjmp,@function
> .type __sigsetjmp,@function
> sigsetjmp:
> __sigsetjmp:
> test %esi,%esi
> jz 1f
>
> popq 64(%rdi)
> mov %rbx,72+8(%rdi)
> mov %rdi,%rbx
>
> call setjmp@PLT
>
> pushq 64(%rbx)
> mov %rbx,%rdi
> mov %eax,%esi
> mov 72+8(%rbx),%rbx
>
> .hidden __sigsetjmp_tail
> jmp __sigsetjmp_tail
>
> 1: jmp setjmp@PLT
> cat musl/src/signal/sigsetjmp_tail.c
> #include <setjmp.h>
> #include <signal.h>
> #include "syscall.h"
>
> hidden int __sigsetjmp_tail(sigjmp_buf jb, int ret)
> {
> void *p = jb->__ss;
> __syscall(SYS_rt_sigprocmask, SIG_SETMASK, ret?p:0, ret?0:p, _NSIG/8);
> return ret;
> }
>
> Please note that we are actually replacing __syscall() in the las file
> with the local function call to sigprocmask() using the  --include
> libc/syscall_to_function.h trick.
>
> As one can tell the implementation of sigsetjmp has changed in musl and
> that is causing some issues on our side. Here is the link to the musl
> commit -
> https://git.musl-libc.org/cgit/musl/commit/?id=583e55122e767b1586286a0d9c35e2a4027998ab
>  (redesign
> sigsetjmp so that signal mask is restored after longjmp) that changed it.
>

This commit has a very good explanation. It's just that it will take a lot
of effort to understand it :-( From the few minutes I spent on it, I didn'
understand :-(


> I am not sure if I fully understand the change but it looks the return
> address is stored in a different place. Maybe because we use thread local
> variable for env causes some issues? Why are using thread local variables
> in those 2 tests?
>

Interesting guess, but you can easily verify it by making it
non-thread-local.
I don't know why I made this variable thread-local - I think it should have
been "static".
"thread-local" can allow running this test multiple times concurrently, but
I don't see why anyone would want to do that...

I think you can get rid of the thread-local and also have a chance to run
it on aarch64 then.


> Now, these two tests also fail on aarch64 but for a different reason - we
> do not have TLS implemented on aarch64 but I guess if we did those tests
> would fail in similar way.
>
> Unfortunately, I do not understand signal handling on OSv and its
> limitations so I am not sure if the changes to sigsetjmp on musl side are
> actually helpful to us and we should upgrade to the newest version. Or
> maybe it would be better to keep the x64 version of it as is and only use
> (copy) the pre-583e55122e767b1586286a0d9c35e2a4027998ab-commit version of
> aarch64 sigsetjmp from musl instead which would be sad.
>
> Any ideas?
>
> Waldek
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/bc3d65f7-f8d1-4bf6-afeb-76be861afa25n%40googlegroups.com
> <https://groups.google.com/d/msgid/osv-dev/bc3d65f7-f8d1-4bf6-afeb-76be861afa25n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyjs-Piogc00P3sKcG7gPGRkr5t5gvu6KbM%3DyteZ6jyfR0A%40mail.gmail.com.

Reply via email to