[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #28 from Jakub Jelinek --- (In reply to Dmitrii Pasechnik from comment #26) > We have megabytes of code calling libraries where setjmp/longjmp is used, > in github.com/sagemath/sage/ (most of it in Cython). > > It looks like a huge hassle to put volatiles there. > Looks like we would need ways to disable particular optimisations which lead > to these sorts of errors:-( You mean turn off all optimizations then? -O0. Even that doesn't guarantee it will work if some such variables are declared with register keyword. It isn't just POSIX which says this, e.g. C99 also says: "All accessible objects have values, and all other components of the abstract machine have state, as of the time the longjmp function was called, except that the values of objects of automatic storage duration that are local to the function containing the invocation of the corresponding setjmp macro that do not have volatile-qualified type and have been changed between the setjmp invocation and longjmp call are indeterminate." C89 said pretty much the same.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=56512 Resolution|--- |INVALID --- Comment #27 from Andrew Pinski --- https://pubs.opengroup.org/onlinepubs/7908799/xsh/longjmp.html The requirement of using volatile has been there a long time before your code was written ...
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #26 from Dmitrii Pasechnik --- We have megabytes of code calling libraries where setjmp/longjmp is used, in github.com/sagemath/sage/ (most of it in Cython). It looks like a huge hassle to put volatiles there. Looks like we would need ways to disable particular optimisations which lead to these sorts of errors:-(
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #25 from Sergei Trofimovich --- (In reply to Richard Biener from comment #24) > (In reply to Sergei Trofimovich from comment #23) > [...] > > Why did `gcc` generate unconditional NULL dereference here? I suspect it > > somehow inferred that `__pyx_t_6 = NULL;` in that branch, but not before > > comparison. > > That's what happens if we isolate an unreachable path because of a NULL > dereference (like if exposed by jump-threading). We make the NULL > dereference volatile so it stays but DCE/DSE can cleanup code on the path > leading to it. > > If you run into such path the this might suggest that jump-threading > triggered > a problem with the setjmp/longjmp, so it's then likely some condition that's > evaluated in a wrong way after the longjmp, either because a dependent > value wasn't properly preserved or by GCC breaking that. Seeing stack memory > arguments used on a call in a previous comment Yeah, that makes sense. Having stared a bit more at __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() I think I get the problem now. We deal with the code similar to the following: __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() { __pyx_t_6 = NULL; // the loop is not very important, but it forces `__pyx_t_6` initialization before `setjmp()` for (;;) { __pyx_t_6 = something(); int done = use_pre_setjmp(__pyx_t_6); __pyx_t_6 = NULL; if (done) break; } int mode = setjmp(jb); switch (mode) { case 1: // longjmp() case break; case 0: // regular case (`case 3:` in real code) __pyx_t_6 = something_else(); // set __pyx_t_6 to non-zero int done = use_post_setjmp(__pyx_t_6); // call longjmp(jb, 1) here __pyx_t_6 = NULL; break; } // get here via longjmp() if (__pyx_t_6 != NULL) deref(__pyx_t_6); } AFAIU `gcc` is smart enough to see that all paths to `deref()` reach with `__pyx_t_6 = NULL`, but it does not eliminate the `deref()` entirely and uses `if (__pyx_t_6 != NULL) deref(NULL);` as Richard explained above. Now due to `longjmp()` `__pyx_t_6 = NULL;` does not get executed (even though it's present in assembly code as `movq $0, -200(%rbp)` in all the places where it's present in C code. As a result after the `longjmp()` `__pyx_t_6` is not `NULL` and we get to `deref(NNULL)` and SIGSEGV. Thus it's a matter of missing `volatile __pyx_t_6`. Sounds about right? > I wondered if POSIX suggests > that even non-register variables need to be made volatile and thus whether > SRA or FRE might impose problems with code using setjmp/longjmp. That matches my understanding as well. Would it be fair to say that sprinkling `volatile` has to be done for every single local variable in the function to prevent possible stack reuse? And that rule should extend to the functions that could host an inline variant of the function using setjmp()/longjmp() and not just immediate caller of setjmp()/longjmp()?
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #24 from Richard Biener --- (In reply to Sergei Trofimovich from comment #23) [...] > Why did `gcc` generate unconditional NULL dereference here? I suspect it > somehow inferred that `__pyx_t_6 = NULL;` in that branch, but not before > comparison. That's what happens if we isolate an unreachable path because of a NULL dereference (like if exposed by jump-threading). We make the NULL dereference volatile so it stays but DCE/DSE can cleanup code on the path leading to it. If you run into such path the this might suggest that jump-threading triggered a problem with the setjmp/longjmp, so it's then likely some condition that's evaluated in a wrong way after the longjmp, either because a dependent value wasn't properly preserved or by GCC breaking that. Seeing stack memory arguments used on a call in a previous comment I wondered if POSIX suggests that even non-register variables need to be made volatile and thus whether SRA or FRE might impose problems with code using setjmp/longjmp.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #23 from Sergei Trofimovich --- At SIGSEGV site the code is an unconditional NULL dereference due to dereference of `xor %esi,%esi` result from `gdb`. 797 if (op != _Py_NULL) { 0x7f940c871563 <+2563>: cmpq $0x0,-0xc8(%rbp) 0x7f940c87156b <+2571>: je 0x7f940c871583 <__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__+2595> 242 return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0; 0x7f940c87156d <+2573>: xor%esi,%esi => 0x7f940c87156f <+2575>: mov(%rsi),%rax In `element-verbose.S` it is: # /usr/include/python3.12/object.h:797: if (op != _Py_NULL) { .loc 5 797 8 is_stmt 0 view .LVU65876 cmpq $0, -200(%rbp)<>#, %sfp je<>.L12727>#, .loc 5 798 9 is_stmt 1 view .LVU65877 .LVL15705: .LBB49946: .LBI49946: .loc 5 696 37 view .LVU65878 .LBB49947: .loc 5 700 5 view .LVU65879 .LBB49948: .LBI49948: .loc 5 239 36 view .LVU65880 .LBB49949: .loc 5 242 5 view .LVU65881 # /usr/include/python3.12/object.h:242: return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0; .loc 5 242 12 is_stmt 0 view .LVU65882 xorl %esi, %esi # r movq (%rsi), %rax # __pyx_t_6_208(ab)->D.11083.ob_refcnt, _991 Looking at other sites in `element-verbose.S` for comparison do try to use `-0xc8(%rbp)` contents: # /usr/include/python3.12/object.h:797: if (op != _Py_NULL) { .loc 5 797 8 is_stmt 0 view .LVU66162 cmpq $0, -200(%rbp) #, %sfp je .L12782>#, .loc 5 798 9 is_stmt 1 view .LVU66163 .LVL15760: .LBB50093: .LBI50093: .loc 5 696 37 view .LVU66164 .LBB50094: .loc 5 700 5 view .LVU66165 .LBB50095: .LBI50095: .loc 5 239 36 view .LVU66166 .LBB50096: .loc 5 242 5 view .LVU66167 # /usr/include/python3.12/object.h:242: return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0; .loc 5 242 12 is_stmt 0 view .LVU66168 movq -200(%rbp), %rdx # %sfp, r movq (%rdx), %rax # __pyx_t_6_10(ab)->D.11083.ob_refcnt, _1070 Thus my guess is that something clobbered `-200(%rbp)` value across setjmp()/longjmp(). Trying to trace: $ gdb -p `pgrep sage-ipython` (gdb) break __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (gdb) continue # trigger break with with ` libgap.AbelianGroup(0,0,0)` (gdb) disassemble Dump of assembler code for function __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__: => 0x7f4ed9981b60 <+0>: push %rbp 0x7f4ed9981b61 <+1>: mov%rsp,%rbp # Populating `%rbp`: (gdb) nexti (gdb) nexti (gdb) disassemble Dump of assembler code for function __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__: 0x7f4ed9981b60 <+0>: push %rbp 0x7f4ed9981b61 <+1>: mov%rsp,%rbp => 0x7f4ed9981b64 <+4>: push %r15 (gdb) print $rbp-200 $2 = (void *) 0x7ffd2824c5e8 (gdb) watch *(int*)(void *) 0x7ffd2824c5e8 Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8 (gdb) continue Continuing. Thread 1 "sage-ipython" hit Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8 Old value = 673498624 New value = 0 0x7f98e609d2a8 in __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ ( __pyx_v_self=__pyx_v_self@entry=0x7f98dfe70dc0, __pyx_v_args=__pyx_v_args@entry=(, , )) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26192 26192 __pyx_t_6 = NULL; NULL store. (gdb) continue Continuing. Thread 1 "sage-ipython" hit Hardware watchpoint 2: *(int*)(void *) 0x7ffd2824c5e8 Old value = 0 New value = -538669696 __Pyx_GetItemInt_List_Fast (wraparound=0, boundscheck=1, i=2, o=[, , ]) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:38070 38070 Py_INCREF(r); Create an object? (gdb) continue Continuing. Thread 1 "sage-ipython" received signal SIGABRT, Aborted. 0x7f99428617a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120 120 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) Abort. (gdb) continue Continuing. Thread 1 "sage-ipython" received signal SIGSEGV, Segmentation fault. 0x7f98e609c56f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242 242 return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0; SIGSEGV. Note that all two memory references happen before longjmp() (the ABORT). Why did `gcc` generate unconditional NULL dereference here? I suspect it somehow inferred that `__pyx_t_6 = NULL;` in that branch, but not before comparison.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #22 from Sergei Trofimovich --- Trying again to catch more precise place for SIGABRT. Beginning at the start of the possibly aborting function: (gdb) break __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (gdb) continue Running step by step to get the rough location: (gdb) continue # many times (gdb) 26459 __pyx_t_6 = __Pyx_GetItemInt_List(__pyx_v_a, 2, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2528, __pyx_L14_error) (gdb) 26469 __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_6)->value); (gdb) Thread 1 "sage-ipython" received signal SIGABRT, Aborted. 0x7f73e4a617a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120 120 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) Don't know why backtrace does not point at element.c:26459. It is this snippet: /* "sage/libs/gap/element.pyx":2525 *(a[1]).value) * elif n == 3: * result = GAP_CallFunc3Args(self.value, # << *(a[0]).value, *(a[1]).value, */ __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_6)->value);
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #21 from Sergei Trofimovich --- Good point! I wonder if I'm looking at the backtrace too late (or at the wrong one). I'll retry again this evening and will extract more context.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #20 from Jakub Jelinek --- (In reply to Sergei Trofimovich from comment #18) > > 2) ideally show a gdb session with the important events, which setjmp was > > it (I see _setjmp and __sigsetjmp calls in the function), which exact > > function called from the function ended up aborting/doing longjmp in the > > signal handler and where is the crash Strange. If it is the _setjmp call on line 26315 followed by kill SIGABRT from sig_error(); on the same line, then I don't see any local vars modified in between, except that int t on line 26315 inside of sig_GAP_Enter macro. There is then __sigsetjmp on line 26324, but no further kill calls and I think this isn't in a loop.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #19 from Dmitrii Pasechnik --- Declaring the last argument in the call to GAP_CallFunc3Args() volatile appears to fix the issue. Namely, apply diff --git a/src/sage/libs/gap/element.pyx b/src/sage/libs/gap/element.pyx index f1482997b8..7ca4a666ab 100644 --- a/src/sage/libs/gap/element.pyx +++ b/src/sage/libs/gap/element.pyx @@ -2504,6 +2504,7 @@ cdef class GapElement_Function(GapElement): cdef Obj result = NULL cdef Obj arg_list cdef int n = len(args) +cdef volatile Obj v2 if n > 0 and n <= 3: libgap = self.parent() @@ -2522,10 +2523,11 @@ cdef class GapElement_Function(GapElement): (a[0]).value, (a[1]).value) elif n == 3: +v2 = (a[2]).value result = GAP_CallFunc3Args(self.value, (a[0]).value, (a[1]).value, - (a[2]).value) + v2) else: arg_list = make_gap_list(args) result = GAP_CallFuncList(self.value, arg_list)
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #18 from Sergei Trofimovich --- > 2) ideally show a gdb session with the important events, which setjmp was it > (I see _setjmp and __sigsetjmp calls in the function), which exact function > called from the function ended up aborting/doing longjmp in the signal > handler and where is the crash # gdb --quiet -p 1180766 Attaching to a running `sage` interactive process. In sage repl typing: libgap.AbelianGroup(0,0,0) Breakpoint happens. SIGABRT (immediate longjmp trigger) backtrace: Thread 1 "sage-ipython" received signal SIGABRT, Aborted. 0x7f53f8e617a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120 120 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) (gdb) bt #0 0x7f53f8e617a7 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120 #1 0x7f539c581edf in sig_error () at /usr/lib/python3.12/site-packages/cysignals/macros.h:298 #2 __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (__pyx_v_self=__pyx_v_self@entry=0x7f539635bc40, __pyx_v_args=__pyx_v_args@entry=(, , )) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26315 #3 0x7f539c5834e7 in __pyx_pw_4sage_4libs_3gap_7element_19GapElement_Function_3__call__ (__pyx_v_self=, __pyx_args=(, , ), __pyx_kwds=) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26105 #4 0x7f53f916496b in _PyObject_MakeTpCall (tstate=0x7f53f9670d08 <_PyRuntime+459656>, callable=callable@entry=, args=args@entry=0x7f53f96b5480, nargs=3, keywords=0x0) at Objects/call.c:240 ... SIGSEGV backtrace (for completeness): Thread 1 "sage-ipython" received signal SIGSEGV, Segmentation fault. 0x7f539c58256f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242 242 return _Py_CAST(PY_INT32_T, op->ob_refcnt) < 0; (gdb) bt #0 0x7f539c58256f in _Py_IsImmortal (op=0x0) at /usr/include/python3.12/object.h:242 #1 Py_DECREF (op=0x0) at /usr/include/python3.12/object.h:700 #2 Py_XDECREF (op=0x0) at /usr/include/python3.12/object.h:798 #3 __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (__pyx_v_self=__pyx_v_self@entry=0x7f539635bc40, __pyx_v_args=__pyx_v_args@entry=(, , )) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26535 #4 0x7f539c5834e7 in __pyx_pw_4sage_4libs_3gap_7element_19GapElement_Function_3__call__ (__pyx_v_self=, __pyx_args=(, , ), __pyx_kwds=) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26105 #5 0x7f53f916496b in _PyObject_MakeTpCall (tstate=0x7f53f9670d08 <_PyRuntime+459656>, callable=callable@entry=, args=args@entry=0x7f53f96b5480, nargs=3, keywords=0x0) at Objects/call.c:240 Catching `*jmp` flavours: (gdb) break __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (gdb) continue ... (gdb) break __GI___sigsetjmp (gdb) break longjmp (gdb) break siglongjmp (gdb) continue # a lot of them (gdb) bt #0 __GI___sigsetjmp () at ../sysdeps/x86_64/setjmp.S:33 #1 0x7fc83e6e6ea4 in __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (__pyx_v_self=__pyx_v_self@entry=0x7fc83844f2c0, __pyx_v_args=__pyx_v_args@entry=(, , )) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26315 (gdb) fr 1 #1 0x7fc83e6e6ea4 in __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__ (__pyx_v_self=__pyx_v_self@entry=0x7fc83844f2c0, __pyx_v_args=__pyx_v_args@entry=(, , )) at /usr/src/debug/sci-mathematics/sagemath-standard-10.3/sagemath-standard-10.3-python3_12/build/cythonized/sage/libs/gap/element.c:26315 26315 sig_GAP_Enter(); (gdb) disassemble Dump of assembler code for function __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__: 0x7fc83e6e6e9f <+831>: call 0x7fc83e6b54e0 <_setjmp@plt> => 0x7fc83e6e6ea4 <+836>: mov$0x1,%ebx That is a _setjmp at element.c:26315. ABORT happens after it.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #17 from Sergei Trofimovich --- > 1) attach your *.s file and state which exact compiler you used (revision) Generate code first: https://slyfox.uni.cx/b/gcc/PR114872/d.tar.gz (4MB, does not fit on bugzilla's 1MB limit) is the archive of source files from sage-10.3: - element.c: original unpreprocessed file - element.c.c: preprocessed file - element.S: assembly file (-S) - element-verbose.S: assembly file (-S -fverbose-asm, has compiler options) Compiler: # gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/13/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /dev/shm/portage/sys-devel/gcc-13.2.1_p20240503/work/gcc-13-20240503/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/13 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/13/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/13 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/13/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/13/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13 --disable-silent-rules --disable-dependency-tracking --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/13/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 13.2.1_p20240503 p15' --with-gcc-major-version-only --enable-libstdcxx-time --enable-lto --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --disable-cet --disable-systemtap --disable-valgrind-annotations --disable-vtable-verify --disable-libvtv --without-zstd --without-isl --enable-default-pie --enable-default-ssp --disable-fixincludes Thread model: posix Supported LTO compression algorithms: zlib gcc version 13.2.1 20240503 (Gentoo 13.2.1_p20240503 p15) Note: it's a weekly snapshot of 20240503 with --enable-default-ssp. I think it's a r13-8684-g704b15e277a879. The compiler also has a few distro-specific patches on top: https://gitweb.gentoo.org/proj/gcc-patches.git/tree/13.2.0/gentoo. Most of them should not affect code generation too much, but some like 22_all_default_ssp-buffer-size.patch and 84_all_x86_PR110792-Early-clobber-issues-with-rot32di2.patch might.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #16 from Jakub Jelinek --- (In reply to Sergei Trofimovich from comment #14) > I reproduced the `SIGSEGV` on Gentoo ~amd64 and ::sage-on-gentoo overlay > against sci-mathematics/sagemath-standard package. > > One of the unusual properties of > __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() is that > it raises 2 signals while it gets executed: > > - SIGABRT handler uses longjmp() to return to the ~beginning of a function > - and then SIGSEGV happens at cleanup when an attempt to dereference the > pointer happens. > > I see no `volatile` annotations anywhere in the > __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__(). > > My wild guess would be that: > 1. `PyObject *__pyx_t_4 = ((void *)0);` gets saved in setjmp() with one > value (probably NULL) > 2. updated at some point later in the same function to non-NULL that `gcc` > can infer and throw away all later `NULL` checks > 3. then SIGABRT returns with longjmp() by accidentally resetting > > I would expect `__pyx_t_4` to require volatile annotation for such an > `element.i` definition. Or `longjmp()` should be called from a `((noipa))` > function to force register spill/reload on stack. > > To cite `man setjmp`: > > """ > CAVEATS >The compiler may optimize variables into registers, and > longjmp() may restore the values of other registers in addition to the stack > pointer and program counter. Consequently, the values of automatic >variables are unspecified after a call to longjmp() if they meet all > the following criteria: >• they are local to the function that made the corresponding > setjmp() call; >• their values are changed between the calls to setjmp() and > longjmp(); and >• they are not declared as volatile. >Analogous remarks apply for siglongjmp(). > """ > > Sounds plausible? So, if you can reproduce it, can you: 1) attach your *.s file and state which exact compiler you used (revision) 2) ideally show a gdb session with the important events, which setjmp was it (I see _setjmp and __sigsetjmp calls in the function), which exact function called from the function ended up aborting/doing longjmp in the signal handler and where is the crash 3) is it __pyx_t_6, __pyx_t_4 or some other pointer that triggers it (from the line numbers in #c0 my guess was __pyx_t_6, but you talk about __pyx_t_4) Yes, there are no volatile keywords on any of the vars, but without knowing which setjmp call it is and from where longjmp jumps to it, it is hard to know if the variables have been modified in between (then volatile would be required) or if they are only modified before the setjmp call or after the call that calls longjmp (then volatile might not be required).
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #15 from Jakub Jelinek --- (In reply to Dmitrii Pasechnik from comment #12) > A colleague disassembled, using ghidra (https://ghidra-sre.org/), the > results of the compilations with, respectively, -O2 and with -O0 flags. > Comparing the results, in the broken (built with -O2) case one sees a > miscompilation of a call to GAP_CallFunc3Args - it is called with one > argument less than it should, three instead of four! > > broken (-O2): > > > plVar9 = (long *)GAP_CallFunc3Args(*(undefined8 *)(param_1 + > > 0x20),local_a0[4], > > local_a8[4]); > > vs. good (-O0): > > < LAB_0013cbd9: > < plVar10 = (long *)GAP_CallFunc3Args(*(undefined8 *)(param_1 + > 0x20),local_a8[4], > < local_a0[4],plVar16[4]); > > And this is despite the prototype for calling GAP_CallFunc3Args() is > found in "gap/libgap-api.h", which is included in example.c as #include > "gap/libgap-api.h", meant to be respected during the compilation. > > I hope this helps in chasing down the obvious compiler bug. Perhaps it can > be also seen without disassembling, simply on the intermediate data > generated by the compiler. Both calls pass 4 arguments, both in optimized -O2 -fPIC GCC 13.2.1 dump and in assembly: movq-88(%rbp), %rdx movq%r12, %rcx movq%r13, %rsi movq%rax, %rdi callGAP_CallFunc3Args@PLT ... movq-88(%rbp), %rdx movq%r12, %rcx movq%r13, %rsi movq%rax, %rdi callGAP_CallFunc3Args@PLT and GAP_CallFunc3Args (_53, _47, __pyx_v_func_54(D), _49); ... _441 = GAP_CallFunc3Args (_98, _97, _96, _95);
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 Sergei Trofimovich changed: What|Removed |Added CC||slyfox at gcc dot gnu.org --- Comment #14 from Sergei Trofimovich --- I reproduced the `SIGSEGV` on Gentoo ~amd64 and ::sage-on-gentoo overlay against sci-mathematics/sagemath-standard package. One of the unusual properties of __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__() is that it raises 2 signals while it gets executed: - SIGABRT handler uses longjmp() to return to the ~beginning of a function - and then SIGSEGV happens at cleanup when an attempt to dereference the pointer happens. I see no `volatile` annotations anywhere in the __pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__(). My wild guess would be that: 1. `PyObject *__pyx_t_4 = ((void *)0);` gets saved in setjmp() with one value (probably NULL) 2. updated at some point later in the same function to non-NULL that `gcc` can infer and throw away all later `NULL` checks 3. then SIGABRT returns with longjmp() by accidentally resetting I would expect `__pyx_t_4` to require volatile annotation for such an `element.i` definition. Or `longjmp()` should be called from a `((noipa))` function to force register spill/reload on stack. To cite `man setjmp`: """ CAVEATS The compiler may optimize variables into registers, and longjmp() may restore the values of other registers in addition to the stack pointer and program counter. Consequently, the values of automatic variables are unspecified after a call to longjmp() if they meet all the following criteria: • they are local to the function that made the corresponding setjmp() call; • their values are changed between the calls to setjmp() and longjmp(); and • they are not declared as volatile. Analogous remarks apply for siglongjmp(). """ Sounds plausible?
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #13 from Dmitrii Pasechnik --- Created attachment 58099 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58099=edit data for comment 12 - decompiled things data for comment 12 - decompiled .so's, .so's themselves, original C file. BROKEN* are for -O2 option, and FIXED* are for -O0 option.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 Dmitrii Pasechnik changed: What|Removed |Added CC||dima.pasechnik at cs dot ox.ac.uk --- Comment #12 from Dmitrii Pasechnik --- A colleague disassembled, using ghidra (https://ghidra-sre.org/), the results of the compilations with, respectively, -O2 and with -O0 flags. Comparing the results, in the broken (built with -O2) case one sees a miscompilation of a call to GAP_CallFunc3Args - it is called with one argument less than it should, three instead of four! broken (-O2): > plVar9 = (long *)GAP_CallFunc3Args(*(undefined8 *)(param_1 + > 0x20),local_a0[4], > local_a8[4]); vs. good (-O0): < LAB_0013cbd9: < plVar10 = (long *)GAP_CallFunc3Args(*(undefined8 *)(param_1 + 0x20),local_a8[4], < local_a0[4],plVar16[4]); And this is despite the prototype for calling GAP_CallFunc3Args() is found in "gap/libgap-api.h", which is included in example.c as #include "gap/libgap-api.h", meant to be respected during the compilation. I hope this helps in chasing down the obvious compiler bug. Perhaps it can be also seen without disassembling, simply on the intermediate data generated by the compiler.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #11 from Sam James --- also, do asan and ubsan give anything?
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #10 from Sam James --- also maybe obvious, but if you can find something from the cython testsuite, or at least some other heavy use of cython, which fails, that may be a good direction.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #9 from Sam James --- unfortunately*
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #8 from Sam James --- element.i is unfortunatley huge. It's hard to analyse things without a standalone testcase, but it's even harder without _something_ one can run. I'd suggest: 1) giving instructions to reproduce the crash assuming someone knows nothing about sage and cython; 2) taking sage and cannibalising it (first to directly call `libgap.AbelianGroup(0,0,0)`, then you can cut things down with lots of removals + gdb so that the important caller of Py_XDECREF gets passed with the same args; it's harder if there's a lot of state involved though, of course) 3) build element.i with -fdump-tree-all -fdump-unnumbered -fdump-noaddr (can also try e.g. -da, see https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html) at the bad commit and before, and diff the produced dumps, and show us the *first* dump which differs between the two
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #7 from Antonio Rojas --- (In reply to Sam James from comment #5) > Some ideas: > * Could you maybe give a reproducer for the runtime crash? In sage, calling any libgap function in exactly 3 parameters triggers this. For instance: sage: libgap.AbelianGroup(0,0,0) Note that you need Python 3.12 to get the crash (which happens when trying to derefence the null pointer in the newly introduced _Py_IsImmortal function) > * Any chance you'd be willing to try bisect element.i with pragmas to > disable/enable optimisation for chunks of it, to find the miscompiled > function? I have done that. As expected, the problem is in Py_XDECREF. This makes the problem disappear: #pragma GCC push_options #pragma GCC optimize ("O0") static inline void Py_XDECREF(PyObject *op) { if (op != # 797 "/usr/include/python3.12/object.h" 3 4 ((void *)0) # 797 "/usr/include/python3.12/object.h" ) { Py_DECREF(((PyObject*)((op; } } #pragma GCC pop_options
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #6 from Jakub Jelinek --- I have looked at the IL and I don't see how it could crash that way based on the -fdump-tree-optimized-lineno dump, neither on branch nor on the trunk. On the branch, I see [local count: 462566023]: [element.c:26539:57 discrim 1] __pyx_t_5_522(ab) = 0B; [/usr/include/python3.12/object.h:797:8] if (__pyx_t_6_211(ab) != 0B) goto ; [70.00%] else goto ; [30.00%] [local count: 462566023]: [element.c:26540:57 discrim 1] __pyx_t_6_524(ab) = 0B; [/usr/include/python3.12/object.h:797:8] if (__pyx_t_8_1031(ab) != 0B) goto ; [70.00%] else goto ; [30.00%] [local count: 323796219]: [/usr/include/python3.12/object.h:242:25] _1105 = [/usr/include/python3.12/object.h:242:25] [/usr/include/python3.12/object.h:242:25] __pyx_t_6_211(ab)->D.10046.ob_refcnt; [/usr/include/python3.12/object.h:242:13] _1106 = (int) _1105; [/usr/include/python3.12/object.h:700:8 discrim 1] if (_1106 < 0) goto ; [26.36%] else goto ; [73.64%] [local count: 238443538]: [/usr/include/python3.12/object.h:704:9] _1107 = _1105 + -1; [/usr/include/python3.12/object.h:704:8] [/usr/include/python3.12/object.h:704:13] [/usr/include/python3.12/object.h:704:13] __pyx_t_6_211(ab)->D.10046.ob_refcnt = _1107; [/usr/include/python3.12/object.h:704:8] if (_1107 == 0) goto ; [33.00%] else goto ; [67.00%] [local count: 78686364]: [/usr/include/python3.12/object.h:705:9] _Py_Dealloc (__pyx_t_6_211(ab)); goto ; [100.00%] so the 26540 line case checks if the pointer later passed to the _Py_Dealloc is non-NULL and doesn't let it be called in that case.
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 --- Comment #5 from Sam James --- Some ideas: * Could you maybe give a reproducer for the runtime crash? * Any chance you'd be willing to try bisect element.i with pragmas to disable/enable optimisation for chunks of it, to find the miscompiled function?
[Bug tree-optimization/114872] [13/14/15 Regression] Miscompilation with -O2 after commit r13-8037
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114872 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |13.3 Summary|Miscompilation with -O2 |[13/14/15 Regression] |after commit r13-8037 |Miscompilation with -O2 ||after commit r13-8037 Known to fail||13.2.1 Known to work||13.2.0