Re: Failing ra-test

Branko Čibej Mon, 01 Jun 2026 02:42:39 -0700

On Sun, 31 May 2026, 16:07 Branko Čibej, <[email protected]> wrote:

> On Sun, 31 May 2026, 15:24 Daniel Sahlberg, <[email protected]>
> wrote:
>
>> Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej <[email protected]>:
>>
>>> On 29. 5. 2026 12:53, Branko Čibej wrote:
>>>
>>> On 29. 5. 2026 12:16, Branko Čibej wrote:
>>>
>>> On 29. 5. 2026 08:50, Branko Čibej wrote:
>>>
>>> On 29. 5. 2026 00:17, Jun Omae wrote:
>>>
>>> Hi,
>>>
>>> On 2026/05/29 4:37, Branko Čibej wrote:
>>>
>>> This intermittent failure shows up sometimes on both Linux and macOS:
>>>
>>> FAIL:  ra-test: Unknown test failure (-11); see tests.log.
>>>
>>> That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in 
>>> repos-test, too, and it doesn't matter if its local or DAV or svnserve. I 
>>> suspect that we have a wonderful little bug in our code. I've been trying, 
>>> on and off, to trace this down either with a debugger or with clang's 
>>> address sanitiser or with APR's pool debugging enabled, but it's an elusive 
>>> little bxtrd and I've not yet been able to track it down. Not even so far 
>>> as to figure out whether it's hiding in FSFS or somewhere else.
>>>
>>> It's been a few years and I think I remember seeing it on the 1.14 branch, 
>>> too. On the other hand, I don't recall any bug reports that could be linked 
>>> to this observation. We're seeing this failure in the CI tests, too, at 
>>> least with autotools. I haven't seen this with CMake, but failures in those 
>>> workflows are far more often related to vcpkg or other environmental stuff, 
>>> so it's a bit hard to find.
>>>
>>> If anyone has any idea where to look without diving into a line-by-line 
>>> review of the code, please share your thoughts. This is starting to get on 
>>> my nerves, just a little bit.
>>>
>>>
>>> The issue occurs in serf_default_destroy_and_data() via 
>>> apr_pool_cleanup_for_exec() from the child process after apr_proc_create() 
>>> to open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or 
>>> serf have something wrong. Also, I guess that the same issue might occur 
>>> when a hook is invoked with SVNMasterURI enabled on Apache mpm event or 
>>> worker.
>>>
>>>
>>> It happens without DAV, too, and with prefork mpm – the only one that
>>> even works on macOS, so I always use that for testing.
>>>
>>>
>>>
>>> This is what I caught on macOS. It was davautocheck, but Serf is not
>>> involved:
>>>
>>> (lldb) target create 
>>> "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core 
>>> "/cores/core.9605"
>>> Core file '/cores/core.9605' (arm64) was loaded.
>>> (lldb) bt
>>> * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
>>>   * frame #0: 0x0000000104a83524 
>>> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
>>> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>>>     frame #1: 0x0000000104d6b5f0 
>>> libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at 
>>> client.c:597:5
>>>     frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
>>>     frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
>>>     frame #4: 0x0000000104b353a0 
>>> libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, 
>>> corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, 
>>> repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, 
>>> callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, 
>>> config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
>>>     frame #5: 0x0000000104a7992c 
>>> ra-test`tunnel_callback_test(opts=0x000000016b3826b8, 
>>> pool=0x000000072d03c028) at ra-test.c:469:3
>>>     frame #6: 0x0000000104aba8a0 
>>> libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, 
>>> data=0x000000016b382570) at svn_test_main.c:577:15
>>>     frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 
>>> 136
>>>
>>>
>>>
>>> Note that this test runs against svnserve with just 'check' too, no DAV
>>> involved at all. So it's either caused by something in libsvn_ra_svn, or
>>> more likely, in ra-test itself. Those tunnels are notoriously tricky, I
>>> remember we had trouble with them before in JavaHL, too.
>>>
>>>
>>> Finally got some interesting info from the core file – it takes some
>>> work to get core dumps on the Mac, setting ulimit -c isn't enough.
>>>
>>> frame #0: 0x0000000104a83524 
>>> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
>>> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>>>    326              apr_proc_wait(b->proc, &child_exit_code, 
>>> &child_exit_why, APR_WAIT);
>>>    327      
>>>    328            SVN_TEST_ASSERT_NO_RETURN(child_exit_status == 
>>> APR_CHILD_DONE);
>>> -> 329            SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
>>>    330            SVN_TEST_ASSERT_NO_RETURN(child_exit_why == 
>>> APR_PROC_EXIT);
>>>    331          }
>>>    332      }
>>> (lldb) p child_exit_status
>>> (apr_status_t) 70005
>>> (lldb) p child_exit_code
>>> (int) 10
>>> (lldb) p child_exit_why
>>> (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE
>>>
>>>
>>>
>>> That means svnserve crashed with a SIGBUS. I'll keep investigating.
>>>
>>>
>>> The SIGBUS is real, but it's not in svnserve – it's in the forked
>>> ra-test and happens before the exec() of svnserve. I had to build apr and
>>> apr-util with debugging enabled to find this, and here's the relevant part
>>> of the stack. For context: On Unix, apr_proc_create() forks the parent then
>>> runs cleanups on the global pool before exec()'ing the real child.
>>>
>>>
>>> (lldb) down
>>> frame #3: 0x0000000101114b10 
>>> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at 
>>> apr_pools.c:2702:9
>>>    2699         run_child_cleanups(&p->cleanups);
>>>    2700     
>>>    2701         for (p = p->child; p; p = p->sibling)
>>> -> 2702             cleanup_pool_for_exec(p);
>>>    2703     }
>>>    2704     
>>>    2705     APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
>>> (lldb) p *p
>>> (apr_pool_t) {
>>>   parent = 0x000000087ac44028
>>>   child = 0x000000087a844028
>>>   sibling = 0x000000087ac94028
>>>   ref = 0x000000087ac44030
>>>   cleanups = NULL
>>>   free_cleanups = 0x000000087ac8e7e8
>>>   allocator = 0x0000000101500b40
>>>   subprocesses = NULL
>>>   abort_fn = 0x0000000101018d94 
>>> (libsvn_subr-1.0.dylib`abort_on_pool_failure at pool.c:53)
>>>   user_data = NULL
>>>   tag = 0x0000000000000000
>>>   active = 0x000000087a800000
>>>   self = 0x000000087ac8e000
>>>   self_first_avail = 0x000000087ac8e0a0 "file"
>>>   pre_cleanups = NULL
>>> }
>>> (lldb) down
>>> frame #2: 0x0000000101114b10 
>>> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at 
>>> apr_pools.c:2702:9
>>>    2699         run_child_cleanups(&p->cleanups);
>>>    2700     
>>>    2701         for (p = p->child; p; p = p->sibling)
>>> -> 2702             cleanup_pool_for_exec(p);
>>>    2703     }
>>>    2704     
>>>    2705     APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
>>> (lldb) p *p
>>> (apr_pool_t) {
>>>   parent = 0x0000000000000a35
>>>   child = 0x0000000000000011
>>>   sibling = 0x0000000000000002
>>>   ref = 0x0000000060232b75
>>>   cleanups = 0x0000000000000001
>>>   free_cleanups = 0x0000000000000003
>>>   allocator = 0x0000000000000011
>>>   subprocesses = 0x0000000000000059
>>>   abort_fn = 0x0000000000000005
>>>   user_data = 0x00000000403dbe48
>>>   tag = 0x0000000000000001 ""
>>>   active = 0x0000000000000002
>>>   self = 0x000000000000006a
>>>   self_first_avail = 0x0000000000000001 ""
>>>   pre_cleanups = 0x0000000000000006
>>> }
>>>
>>>
>>> Note he pool structure from frame #2: it's a clear case of memory
>>> corruption. Most of the pointers there are invalid, some are also
>>> misaligned. Calling run_child_cleanups(&p->cleanups) triggers an
>>> immediate SIGBUS because cleanups is misaligned. The fun part now will
>>> be trying to find out if this is a bug in APR pools – unlikely – or a
>>> memory corruption bug in ra-test or libsvn_ra – also unlikely but it has to
>>> be one of these.
>>>
>>> What finally pointed me in the right direction is that for every crash
>>> of ra-test, *two* core files were created. One from the parent which
>>> aborted because the child's exit code was non-zero, and one from the child.
>>> Well, at least it won't be boring.
>>>
>>
>> Great detective work! Have we seen this on anything other than macOS
>> (sorry I can't remember the details)?
>>
>> It is a bit over my head, but shout out if there is anything I can do to
>> help.
>>
>
> We see intermittent failures in the Linux CI, but I don't know if it's the
> same issue or something else.
>
> I could only reproduce this when running tests in parallel.
>



I should expand on this a bit. ra-test limits the number of concurrent
threads to 4, and I saw far less than 1% of test runs fail. But I have 16
cores on this laptop and when I increased the thread limit to 16, the
failure rate was closer to 30%. Definitely something related to pools and
concurrency.

I've looked carefully through the APR pool code and from what I could see
it's thread safe as long as the allocator has a mutex. Our C test driver
however creates a separate allocator for each thread that does not have a
mutex. This should be safe because access to that allocator and its pool is
single-threaded.

But after fork(), the child process runs all child cleanups from the global
pool downwards. That access is single threaded too, but it's possible that
some interaction between the global pool's mutex and fork() leaves part of
the pool hierarchy only partially initialised. That would be ... unexpected
to say the least, but it's the only hypothesis I have for now.

I'll try removing the "optimisation" that creates per-thread allocators
without mutexes, I think it's uncalled for and these are tests anyway. We
won't have all that much contention on that mutex – it's only used when
pools are created and destroyed, not when they're cleared or allocated from.

-- Brane

Re: Failing ra-test

Reply via email to