On Monday, October 5, 2020 at 4:35:49 PM UTC-4 Nadav Har'El wrote:

> On Mon, Oct 5, 2020 at 9:48 PM Waldek Kozaczuk <[email protected]> wrote:
>
>>
>> On Thursday, October 1, 2020 at 2:56:09 PM UTC-4 Nadav Har'El wrote:
>>
>>>
>>> I looked at the difference between the libc and musl implementation of 
>>> sigsetjmp.s, and there are extensive differences I don't understand.
>>> One of them is a difference between a "call" and a "jmp". One of them is 
>>> the use of "@PLT" in the code.
>>> Maybe that's an ABI (stack alignment) problem? Maybe linker?
>>> There's also a new bizarre __sigsetjmp_tail which I don't understand.
>>>
>>> Can you try switching those different functions one by one and perhaps 
>>> finding which one is causing the problem? 
>>>
>> I have narrowed down the tst-feexcept.cc to have only the following two 
>> assertions on (remaining code after in main is commented out) and also have 
>> changed env to be regular variable (see below) :
>>
>> static sigjmp_buf env;
>>
>
> I thought about it, thread_local is weird, but should have also worked 
> (and apparently, doesn't work exactly the same) -  the reason is that for 
> synchronic signals (sigfpe, sigsegv - things that happen because of bad 
> instructions, not signal from another thread), Osv runs the handler in the 
> same thread.
>
>
>> ...
>>
>> int main(int argc, char **argv)
>> {
>>     // Test that fegetexcept() does not return a negative number
>>     expect(fegetexcept() >= 0, true);
>>
>
> I guess you can drop this from the test too, and it will fail just the 
> same.
>
>
>>     // Test that *integer* division by zero generates, ironically, a 
>> SIGFPE
>>     expect(sig_check([] { printf("%d\n", 1 / zero_i()); }, SIGFPE), true);
>>
>>     std::cout << "SUMMARY: " << tests << " tests, " << fails << " 
>> failures\n";
>>     return fails == 0 ? 0 : 1;
>> }
>>
>> And the test still crashes (and 2 assertions pass). Obviously, if I keep 
>> the 1st assertion only, it does not crash. So something to do with 
>> sig_check().
>>
>> Now, I have discovered that if I comment out the invocation of f() in 
>> sig_check the test does not crash
>>
>
> The invocation of f() is what causes siglongjmp() (in the signal handler) 
> to be called.
> You can add a printout in the signal handler before siglongjmp() and see 
> that if it was reached, and then add a printout in if (sigsetjmp(env, 1)) { 
> (inside the if) - if that printout isn't shown, the siglongjmp didn't work.
>
The printouts in this "if" do show up so it is all very weird. As if 
somehow siglongjmp() ran one more time than necessary to re-run the same 
code again. Or maybe new sigsetjmp() is broken in a weird way where it 
jumps to continue in sig_check() and once main() of this unit test 
completes, the control is returned to the same original sigsetjmp() 
invocation (somewhere in that assembly code) where most of the variables 
like the original app are destroyed. I do not know what to make of it.

>  
>
>> but the 2nd assertion obviously fails.
>>
>> template<typename Func>
>> bool sig_check(Func f, int sig) {
>>     // save_except works around a bug in Linux glibc
>>     // (https://sourceware.org/bugzilla/show_bug.cgi?id=21035) where
>>     // longjmp resets the FPU exception mask. So if we want it to  survive
>>     // the longjmp in this test, we unfortunately need to save it 
>> ourselves
>>     int save_except = fegetexcept();
>>     if (sigsetjmp(env, 1)) {
>>         // got the signal (and automatically restored handler)
>>         fedisableexcept(FE_ALL_EXCEPT);
>>         feenableexcept(save_except);
>>         return true;
>>     }
>>     struct sigaction sa;
>>     sa.sa_flags = SA_RESETHAND;
>>     sigemptyset(&sa.sa_mask);
>>     sa.sa_handler = [] (int) {
>>         siglongjmp(env, 1);
>>     };
>>     assert(sigaction(sig, &sa, NULL) == 0);
>>     //f();
>>     sa.sa_handler = SIG_DFL;
>>     assert(sigaction(sig, &sa, NULL) == 0);
>>     fedisableexcept(FE_ALL_EXCEPT);
>>     feenableexcept(save_except);
>>     return false;
>> }
>>
>> So maybe something is wrong with sa_handler - siglongjmp(env, 1); - which 
>> when called for division by zero exception possibly wrote over the memory.
>>
>
> On a wild hunch, can you please try to compile this test without 
> optimization (-O0), maybe it changes something (as in 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56982)? But I guess it's not 
> that, it's something else :-( 
>
> What do we actually benefit from taking Musl's new code and not keep the 
> old code, which worked well?
>
>From a strictly technical perspective, we do not know until we understand 
that musl commit message :-)
 
>From non-technical perspective, the main motivation is to get 
setjmp/longjmp along with their signal friends working on aarch64. The best 
and most logical way to accomplish it would be to point to the newest 
1.1.24 code in musl for aarch64 and at the same time for x64. That way we 
would get rid of even more OSv version/old musl copy of the code under 
libc/. If the new musl code does not work on x64 we can keep the old musl 
copies of siglongjmp.c and sigsetjmp.s under libc (siglongjmp.c is actually 
the same for both x64 and aarch64) and add the pre-
583e55122e767b1586286a0d9c35e2a4027998ab 
<https://git.musl-libc.org/cgit/musl/commit/?id=583e55122e767b1586286a0d9c35e2a4027998ab>
  
commit copy of ssigsetjmp.s for aarch64 under libc/ and be done with it. 
This unfortunately means we would have a mix of both new (1.1.24) and old 
musl code. But how often will that code change in musl and how 
significantly will it affect us?

At this moment we only use setjmp.s and longjmp.s from musl as-is for both 
x64 and aarch64. And use the following signal related code from musl:

musl += signal/killpg.o
musl += signal/siginterrupt.o
musl += signal/sigrtmin.o
musl += signal/sigrtmax.o
musl += string/strsignal.o 

What is the likelihood that once we upgrade to newer musl (1.2.1 for 
example), would we have come logical conflicts between the as-is musl code 
and "frozen-in-time" implementation of sigsetjmp/siglongjmp? 


> I think you can get rid of the thread-local and also have a chance to run 
>>> it on aarch64 then.
>>>
>> Getting rid of thread_local for env and making it static does not help on 
>> x64 - still exact same crash (which is good news in sense:-). Interestingly 
>> the same test does not crash on aarch64 now (no thread_local which is not 
>> supported yet) but some assertions fail. 
>>
>
> Yes - it shouldn't be thread_local but for synchronous signals (like 
> SIGFPE and SIGSEGV) it doesn't matter - the handler runs in the same thread.
>
It looks like you think that we can change both tst-feexcept.cc and 
tst-sigalstack.cc to replace thread local env variable with a regular 
static one. It will help with testing aarch64 until we get TLS working 
there (more about it in a separate email). I will send a separate patch.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/5d4cfef8-23de-4dee-9134-2c674b60841en%40googlegroups.com.

Reply via email to