Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej <[email protected]>:
> On 29. 5. 2026 12:53, Branko Čibej wrote:
>
> On 29. 5. 2026 12:16, Branko Čibej wrote:
>
> On 29. 5. 2026 08:50, Branko Čibej wrote:
>
> On 29. 5. 2026 00:17, Jun Omae wrote:
>
> Hi,
>
> On 2026/05/29 4:37, Branko Čibej wrote:
>
> This intermittent failure shows up sometimes on both Linux and macOS:
>
> FAIL: ra-test: Unknown test failure (-11); see tests.log.
>
> That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in
> repos-test, too, and it doesn't matter if its local or DAV or svnserve. I
> suspect that we have a wonderful little bug in our code. I've been trying, on
> and off, to trace this down either with a debugger or with clang's address
> sanitiser or with APR's pool debugging enabled, but it's an elusive little
> bxtrd and I've not yet been able to track it down. Not even so far as to
> figure out whether it's hiding in FSFS or somewhere else.
>
> It's been a few years and I think I remember seeing it on the 1.14 branch,
> too. On the other hand, I don't recall any bug reports that could be linked
> to this observation. We're seeing this failure in the CI tests, too, at least
> with autotools. I haven't seen this with CMake, but failures in those
> workflows are far more often related to vcpkg or other environmental stuff,
> so it's a bit hard to find.
>
> If anyone has any idea where to look without diving into a line-by-line
> review of the code, please share your thoughts. This is starting to get on my
> nerves, just a little bit.
>
>
> The issue occurs in serf_default_destroy_and_data() via
> apr_pool_cleanup_for_exec() from the child process after apr_proc_create() to
> open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or serf
> have something wrong. Also, I guess that the same issue might occur when a
> hook is invoked with SVNMasterURI enabled on Apache mpm event or worker.
>
>
> It happens without DAV, too, and with prefork mpm – the only one that even
> works on macOS, so I always use that for testing.
>
>
>
> This is what I caught on macOS. It was davautocheck, but Serf is not
> involved:
>
> (lldb) target create
> "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core
> "/cores/core.9605"
> Core file '/cores/core.9605' (arm64) was loaded.
> (lldb) bt
> * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
> * frame #0: 0x0000000104a83524
> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0,
> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
> frame #1: 0x0000000104d6b5f0
> libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at
> client.c:597:5
> frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
> frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
> frame #4: 0x0000000104b353a0
> libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0,
> corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000,
> repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000,
> callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000,
> config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
> frame #5: 0x0000000104a7992c
> ra-test`tunnel_callback_test(opts=0x000000016b3826b8,
> pool=0x000000072d03c028) at ra-test.c:469:3
> frame #6: 0x0000000104aba8a0
> libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150,
> data=0x000000016b382570) at svn_test_main.c:577:15
> frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136
>
>
>
> Note that this test runs against svnserve with just 'check' too, no DAV
> involved at all. So it's either caused by something in libsvn_ra_svn, or
> more likely, in ra-test itself. Those tunnels are notoriously tricky, I
> remember we had trouble with them before in JavaHL, too.
>
>
> Finally got some interesting info from the core file – it takes some work
> to get core dumps on the Mac, setting ulimit -c isn't enough.
>
> frame #0: 0x0000000104a83524
> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0,
> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
> 326 apr_proc_wait(b->proc, &child_exit_code,
> &child_exit_why, APR_WAIT);
> 327
> 328 SVN_TEST_ASSERT_NO_RETURN(child_exit_status ==
> APR_CHILD_DONE);
> -> 329 SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
> 330 SVN_TEST_ASSERT_NO_RETURN(child_exit_why ==
> APR_PROC_EXIT);
> 331 }
> 332 }
> (lldb) p child_exit_status
> (apr_status_t) 70005
> (lldb) p child_exit_code
> (int) 10
> (lldb) p child_exit_why
> (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE
>
>
>
> That means svnserve crashed with a SIGBUS. I'll keep investigating.
>
>
> The SIGBUS is real, but it's not in svnserve – it's in the forked ra-test
> and happens before the exec() of svnserve. I had to build apr and apr-util
> with debugging enabled to find this, and here's the relevant part of the
> stack. For context: On Unix, apr_proc_create() forks the parent then runs
> cleanups on the global pool before exec()'ing the real child.
>
>
> (lldb) down
> frame #3: 0x0000000101114b10
> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at
> apr_pools.c:2702:9
> 2699 run_child_cleanups(&p->cleanups);
> 2700
> 2701 for (p = p->child; p; p = p->sibling)
> -> 2702 cleanup_pool_for_exec(p);
> 2703 }
> 2704
> 2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
> (lldb) p *p
> (apr_pool_t) {
> parent = 0x000000087ac44028
> child = 0x000000087a844028
> sibling = 0x000000087ac94028
> ref = 0x000000087ac44030
> cleanups = NULL
> free_cleanups = 0x000000087ac8e7e8
> allocator = 0x0000000101500b40
> subprocesses = NULL
> abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure
> at pool.c:53)
> user_data = NULL
> tag = 0x0000000000000000
> active = 0x000000087a800000
> self = 0x000000087ac8e000
> self_first_avail = 0x000000087ac8e0a0 "file"
> pre_cleanups = NULL
> }
> (lldb) down
> frame #2: 0x0000000101114b10
> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at
> apr_pools.c:2702:9
> 2699 run_child_cleanups(&p->cleanups);
> 2700
> 2701 for (p = p->child; p; p = p->sibling)
> -> 2702 cleanup_pool_for_exec(p);
> 2703 }
> 2704
> 2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
> (lldb) p *p
> (apr_pool_t) {
> parent = 0x0000000000000a35
> child = 0x0000000000000011
> sibling = 0x0000000000000002
> ref = 0x0000000060232b75
> cleanups = 0x0000000000000001
> free_cleanups = 0x0000000000000003
> allocator = 0x0000000000000011
> subprocesses = 0x0000000000000059
> abort_fn = 0x0000000000000005
> user_data = 0x00000000403dbe48
> tag = 0x0000000000000001 ""
> active = 0x0000000000000002
> self = 0x000000000000006a
> self_first_avail = 0x0000000000000001 ""
> pre_cleanups = 0x0000000000000006
> }
>
>
> Note he pool structure from frame #2: it's a clear case of memory
> corruption. Most of the pointers there are invalid, some are also
> misaligned. Calling run_child_cleanups(&p->cleanups) triggers an
> immediate SIGBUS because cleanups is misaligned. The fun part now will be
> trying to find out if this is a bug in APR pools – unlikely – or a memory
> corruption bug in ra-test or libsvn_ra – also unlikely but it has to be one
> of these.
>
> What finally pointed me in the right direction is that for every crash of
> ra-test, *two* core files were created. One from the parent which aborted
> because the child's exit code was non-zero, and one from the child. Well,
> at least it won't be boring.
>
Great detective work! Have we seen this on anything other than macOS (sorry
I can't remember the details)?
It is a bit over my head, but shout out if there is anything I can do to
help.
Cheers,
/Daniel