On Sun, 31 May 2026, 15:24 Daniel Sahlberg, <[email protected]> wrote:
> Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej <[email protected]>: > >> On 29. 5. 2026 12:53, Branko Čibej wrote: >> >> On 29. 5. 2026 12:16, Branko Čibej wrote: >> >> On 29. 5. 2026 08:50, Branko Čibej wrote: >> >> On 29. 5. 2026 00:17, Jun Omae wrote: >> >> Hi, >> >> On 2026/05/29 4:37, Branko Čibej wrote: >> >> This intermittent failure shows up sometimes on both Linux and macOS: >> >> FAIL: ra-test: Unknown test failure (-11); see tests.log. >> >> That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in >> repos-test, too, and it doesn't matter if its local or DAV or svnserve. I >> suspect that we have a wonderful little bug in our code. I've been trying, >> on and off, to trace this down either with a debugger or with clang's >> address sanitiser or with APR's pool debugging enabled, but it's an elusive >> little bxtrd and I've not yet been able to track it down. Not even so far as >> to figure out whether it's hiding in FSFS or somewhere else. >> >> It's been a few years and I think I remember seeing it on the 1.14 branch, >> too. On the other hand, I don't recall any bug reports that could be linked >> to this observation. We're seeing this failure in the CI tests, too, at >> least with autotools. I haven't seen this with CMake, but failures in those >> workflows are far more often related to vcpkg or other environmental stuff, >> so it's a bit hard to find. >> >> If anyone has any idea where to look without diving into a line-by-line >> review of the code, please share your thoughts. This is starting to get on >> my nerves, just a little bit. >> >> >> The issue occurs in serf_default_destroy_and_data() via >> apr_pool_cleanup_for_exec() from the child process after apr_proc_create() >> to open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or >> serf have something wrong. Also, I guess that the same issue might occur >> when a hook is invoked with SVNMasterURI enabled on Apache mpm event or >> worker. >> >> >> It happens without DAV, too, and with prefork mpm – the only one that >> even works on macOS, so I always use that for testing. >> >> >> >> This is what I caught on macOS. It was davautocheck, but Serf is not >> involved: >> >> (lldb) target create >> "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core >> "/cores/core.9605" >> Core file '/cores/core.9605' (arm64) was loaded. >> (lldb) bt >> * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0) >> * frame #0: 0x0000000104a83524 >> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, >> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7 >> frame #1: 0x0000000104d6b5f0 >> libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at >> client.c:597:5 >> frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88 >> frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68 >> frame #4: 0x0000000104b353a0 >> libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, >> corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, >> repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, >> callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, >> config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7 >> frame #5: 0x0000000104a7992c >> ra-test`tunnel_callback_test(opts=0x000000016b3826b8, >> pool=0x000000072d03c028) at ra-test.c:469:3 >> frame #6: 0x0000000104aba8a0 >> libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, >> data=0x000000016b382570) at svn_test_main.c:577:15 >> frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136 >> >> >> >> Note that this test runs against svnserve with just 'check' too, no DAV >> involved at all. So it's either caused by something in libsvn_ra_svn, or >> more likely, in ra-test itself. Those tunnels are notoriously tricky, I >> remember we had trouble with them before in JavaHL, too. >> >> >> Finally got some interesting info from the core file – it takes some work >> to get core dumps on the Mac, setting ulimit -c isn't enough. >> >> frame #0: 0x0000000104a83524 >> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, >> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7 >> 326 apr_proc_wait(b->proc, &child_exit_code, >> &child_exit_why, APR_WAIT); >> 327 >> 328 SVN_TEST_ASSERT_NO_RETURN(child_exit_status == >> APR_CHILD_DONE); >> -> 329 SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0); >> 330 SVN_TEST_ASSERT_NO_RETURN(child_exit_why == >> APR_PROC_EXIT); >> 331 } >> 332 } >> (lldb) p child_exit_status >> (apr_status_t) 70005 >> (lldb) p child_exit_code >> (int) 10 >> (lldb) p child_exit_why >> (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE >> >> >> >> That means svnserve crashed with a SIGBUS. I'll keep investigating. >> >> >> The SIGBUS is real, but it's not in svnserve – it's in the forked ra-test >> and happens before the exec() of svnserve. I had to build apr and apr-util >> with debugging enabled to find this, and here's the relevant part of the >> stack. For context: On Unix, apr_proc_create() forks the parent then runs >> cleanups on the global pool before exec()'ing the real child. >> >> >> (lldb) down >> frame #3: 0x0000000101114b10 >> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at >> apr_pools.c:2702:9 >> 2699 run_child_cleanups(&p->cleanups); >> 2700 >> 2701 for (p = p->child; p; p = p->sibling) >> -> 2702 cleanup_pool_for_exec(p); >> 2703 } >> 2704 >> 2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void) >> (lldb) p *p >> (apr_pool_t) { >> parent = 0x000000087ac44028 >> child = 0x000000087a844028 >> sibling = 0x000000087ac94028 >> ref = 0x000000087ac44030 >> cleanups = NULL >> free_cleanups = 0x000000087ac8e7e8 >> allocator = 0x0000000101500b40 >> subprocesses = NULL >> abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure >> at pool.c:53) >> user_data = NULL >> tag = 0x0000000000000000 >> active = 0x000000087a800000 >> self = 0x000000087ac8e000 >> self_first_avail = 0x000000087ac8e0a0 "file" >> pre_cleanups = NULL >> } >> (lldb) down >> frame #2: 0x0000000101114b10 >> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at >> apr_pools.c:2702:9 >> 2699 run_child_cleanups(&p->cleanups); >> 2700 >> 2701 for (p = p->child; p; p = p->sibling) >> -> 2702 cleanup_pool_for_exec(p); >> 2703 } >> 2704 >> 2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void) >> (lldb) p *p >> (apr_pool_t) { >> parent = 0x0000000000000a35 >> child = 0x0000000000000011 >> sibling = 0x0000000000000002 >> ref = 0x0000000060232b75 >> cleanups = 0x0000000000000001 >> free_cleanups = 0x0000000000000003 >> allocator = 0x0000000000000011 >> subprocesses = 0x0000000000000059 >> abort_fn = 0x0000000000000005 >> user_data = 0x00000000403dbe48 >> tag = 0x0000000000000001 "" >> active = 0x0000000000000002 >> self = 0x000000000000006a >> self_first_avail = 0x0000000000000001 "" >> pre_cleanups = 0x0000000000000006 >> } >> >> >> Note he pool structure from frame #2: it's a clear case of memory >> corruption. Most of the pointers there are invalid, some are also >> misaligned. Calling run_child_cleanups(&p->cleanups) triggers an >> immediate SIGBUS because cleanups is misaligned. The fun part now will >> be trying to find out if this is a bug in APR pools – unlikely – or a >> memory corruption bug in ra-test or libsvn_ra – also unlikely but it has to >> be one of these. >> >> What finally pointed me in the right direction is that for every crash of >> ra-test, *two* core files were created. One from the parent which >> aborted because the child's exit code was non-zero, and one from the child. >> Well, at least it won't be boring. >> > > Great detective work! Have we seen this on anything other than macOS > (sorry I can't remember the details)? > > It is a bit over my head, but shout out if there is anything I can do to > help. > We see intermittent failures in the Linux CI, but I don't know if it's the same issue or something else. I could only reproduce this when running tests in parallel. -- Brane

