Hi,

On Thu, 2021-12-16 at 01:10 +0000, build...@builder.wildebeest.org
wrote:
> The Buildbot has detected a new failure on builder elfutils-centos-
> x86_64 while building elfutils.
> Full details are available at:
>     https://builder.wildebeest.org/buildbot/#builders/1/builds/884
> 
> Buildbot URL: https://builder.wildebeest.org/buildbot/
> 
> Worker for this Build: centos-x86_64
> 
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <a...@linutronix.de>
> 
> BUILD FAILED: failed test (failure)
> 
> Sincerely,
>  -The BuildbotThe Buildbot has detected a new failure on builder
> elfutils-fedora-x86_64 while building elfutils.
> Full details are available at:
>     https://builder.wildebeest.org/buildbot/#builders/3/builds/876
> 
> Buildbot URL: https://builder.wildebeest.org/buildbot/
> 
> Worker for this Build: fedora-x86_64
> 
> Build Reason: <unknown>
> Blamelist: Alexander Kanavin <a...@linutronix.de>
> 
> BUILD FAILED: failed test (failure)

So this is really unfortunate and has nothing to do with the patch from
Alexander.

These are two different, but related failures.

On centos-x86_64 this is:

FAIL: run-backtrace-native-core-biarch.sh
=========================================

/usr/bin/coredumpctl
0xf77ac000      0xf77ad000      linux-gate.so.1
0xf77ad000      0xf77d08fc      ld-linux.so.2
0xf75b4000      0xf777ea1c      libc.so.6
0xf777f000      0xf7799248      libpthread.so.0
0x5658e000      0x56591050      backtrace-child-biarch
TID 24658:
# 0 0xf77ac430          __kernel_vsyscall
# 1 0xf778dd16 - 1      raise
# 2 0x5658eafc - 1      sigusr2
# 3 0x5658ebeb - 1      stdarg
# 4 0x5658ec2f - 1      backtracegen
# 5 0x5658ec38 - 1      start
# 6 0xf7785bbc - 1      start_thread
# 7 0xf76b227e - 1      __clone
TID 24656:
# 0 0xf76b2268          __clone
/srv/buildbot/worker/elfutils-centos-x86_64/build/tests/backtrace:
dwfl_thread_getframes: No DWARF information found
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 24682 Aborted                 (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-biarch-core.24656: no main

Note that this is a i386 process being backtraced on x86_64.

On fedora-x86_64 this is:


FAIL: run-backtrace-native-core.sh
==================================

/usr/bin/coredumpctl
0x7ffd3d934000  0x7ffd3d935000  linux-vdso.so.1
0x7f4ccbf99000  0x7f4ccbfcd200  ld-linux-x86-64.so.2
0x7f4ccbd7b000  0x7f4ccbf84ad0  libc.so.6
0x56038dbfc000  0x56038dc000a8  backtrace-child
TID 3043057:
# 0 0x7f4ccbe0a89c      __pthread_kill_implementation
# 1 0x7f4ccbdbd6b6 - 1  raise
# 2 0x56038dbfd3fd - 1  sigusr2
# 3 0x56038dbfd4ca - 1  stdarg
# 4 0x56038dbfd4e0 - 1  backtracegen
# 5 0x56038dbfd4e9 - 1  start
# 6 0x7f4ccbe08ad7 - 1  start_thread
# 7 0x7f4ccbe8d770 - 1  __clone3
TID 3043052:
# 0 0x7f4ccbe8d75d      __clone3
/srv/buildbot/worker/elfutils-fedora-x86_64/build/tests/backtrace:
dwfl_thread_getframes: address out of range
backtrace: backtrace.c:81: callback_verify: Assertion `seen_main'
failed.
./test-subr.sh: line 84: 3043062 Aborted                 (core dumped)
LD_LIBRARY_PATH="${built_library_path}${LD_LIBRARY_PATH:+:}$LD_LIBRARY_
PATH" $VALGRIND_CMD "$@"
backtrace-child-core.3043052: no main
rmdir: failed to remove 'test-3043029': Directory not empty
FAIL run-backtrace-native-core.sh (exit status: 1)

This is an x86_64 process core being backtraced on x86_64.

The problem in both cases is that the parent cannot unwind from the
exact pc it is stuck at. With eu-stack -v --core we can see (for the
parent TID):

TID 3043052:
#0  0x00007f4ccbe8d75d     __clone3 - libc.so.6
    ../sysdeps/unix/sysv/linux/x86_64/clone3.S:62
eu-stack: dwfl_thread_getframes tid 3043052 at 0x7f4ccbe8d75d in
libc.so.6: address out of range

That is this source code:

ENTRY (__clone3)
        /* Sanity check arguments.  */
        movl    $-EINVAL, %eax
        test    %RDI_LP, %RDI_LP        /* No NULL cl_args pointer.  */
        jz      SYSCALL_ERROR_LABEL
        test    %RDX_LP, %RDX_LP        /* No NULL function pointer.  */
        jz      SYSCALL_ERROR_LABEL

        /* Save the cl_args pointer in R8 which is preserved by the
           syscall.  */
        mov     %RCX_LP, %R8_LP

        /* Do the system call.  */
        movl    $SYS_ify(clone3), %eax

        /* End FDE now, because in the child the unwind info will be
           wrong.  */
        cfi_endproc
        syscall

=>      test    %RAX_LP, %RAX_LP
        jl      SYSCALL_ERROR_LABEL
        jz      L(thread_start)

        ret

L(thread_start):
        cfi_startproc
        /* Clearing frame pointer is insufficient, use CFI.  */
        cfi_undefined (rip)
        /* Clear the frame pointer.  The ABI suggests this be done, to mark
           the outermost frame obviously.  */
        xorl    %ebp, %ebp

        /* Align stack to 16 bytes per the x86-64 psABI.  */
        and     $-16, %RSP_LP
[...]

So the PC is right after the syscall, when as the code says there is no
CFI. Apparently the child ran first and quickly got to the terminating
kill, while the parent was still stuck in the syscall (or just out of
it, but not yet returned from the clone3 call.

I think some synchronization is missed between the parent and child.
But the test code is fairly complex.

Cheers,

Mark 

Reply via email to