Re: Failing ra-test

Branko Čibej Sun, 31 May 2026 07:07:45 -0700

On Sun, 31 May 2026, 15:24 Daniel Sahlberg, <[email protected]>
wrote:


> Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej <[email protected]>:
>
>> On 29. 5. 2026 12:53, Branko Čibej wrote:
>>
>> On 29. 5. 2026 12:16, Branko Čibej wrote:
>>
>> On 29. 5. 2026 08:50, Branko Čibej wrote:
>>
>> On 29. 5. 2026 00:17, Jun Omae wrote:
>>
>> Hi,
>>
>> On 2026/05/29 4:37, Branko Čibej wrote:
>>
>> This intermittent failure shows up sometimes on both Linux and macOS:
>>
>> FAIL:  ra-test: Unknown test failure (-11); see tests.log.
>>
>> That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in 
>> repos-test, too, and it doesn't matter if its local or DAV or svnserve. I 
>> suspect that we have a wonderful little bug in our code. I've been trying, 
>> on and off, to trace this down either with a debugger or with clang's 
>> address sanitiser or with APR's pool debugging enabled, but it's an elusive 
>> little bxtrd and I've not yet been able to track it down. Not even so far as 
>> to figure out whether it's hiding in FSFS or somewhere else.
>>
>> It's been a few years and I think I remember seeing it on the 1.14 branch, 
>> too. On the other hand, I don't recall any bug reports that could be linked 
>> to this observation. We're seeing this failure in the CI tests, too, at 
>> least with autotools. I haven't seen this with CMake, but failures in those 
>> workflows are far more often related to vcpkg or other environmental stuff, 
>> so it's a bit hard to find.
>>
>> If anyone has any idea where to look without diving into a line-by-line 
>> review of the code, please share your thoughts. This is starting to get on 
>> my nerves, just a little bit.
>>
>>
>> The issue occurs in serf_default_destroy_and_data() via 
>> apr_pool_cleanup_for_exec() from the child process after apr_proc_create() 
>> to open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or 
>> serf have something wrong. Also, I guess that the same issue might occur 
>> when a hook is invoked with SVNMasterURI enabled on Apache mpm event or 
>> worker.
>>
>>
>> It happens without DAV, too, and with prefork mpm – the only one that
>> even works on macOS, so I always use that for testing.
>>
>>
>>
>> This is what I caught on macOS. It was davautocheck, but Serf is not
>> involved:
>>
>> (lldb) target create 
>> "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core 
>> "/cores/core.9605"
>> Core file '/cores/core.9605' (arm64) was loaded.
>> (lldb) bt
>> * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
>>   * frame #0: 0x0000000104a83524 
>> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
>> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>>     frame #1: 0x0000000104d6b5f0 
>> libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at 
>> client.c:597:5
>>     frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
>>     frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
>>     frame #4: 0x0000000104b353a0 
>> libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, 
>> corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, 
>> repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, 
>> callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, 
>> config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
>>     frame #5: 0x0000000104a7992c 
>> ra-test`tunnel_callback_test(opts=0x000000016b3826b8, 
>> pool=0x000000072d03c028) at ra-test.c:469:3
>>     frame #6: 0x0000000104aba8a0 
>> libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, 
>> data=0x000000016b382570) at svn_test_main.c:577:15
>>     frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136
>>
>>
>>
>> Note that this test runs against svnserve with just 'check' too, no DAV
>> involved at all. So it's either caused by something in libsvn_ra_svn, or
>> more likely, in ra-test itself. Those tunnels are notoriously tricky, I
>> remember we had trouble with them before in JavaHL, too.
>>
>>
>> Finally got some interesting info from the core file – it takes some work
>> to get core dumps on the Mac, setting ulimit -c isn't enough.
>>
>> frame #0: 0x0000000104a83524 
>> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
>> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>>    326               apr_proc_wait(b->proc, &child_exit_code, 
>> &child_exit_why, APR_WAIT);
>>    327       
>>    328             SVN_TEST_ASSERT_NO_RETURN(child_exit_status == 
>> APR_CHILD_DONE);
>> -> 329             SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
>>    330             SVN_TEST_ASSERT_NO_RETURN(child_exit_why == 
>> APR_PROC_EXIT);
>>    331           }
>>    332       }
>> (lldb) p child_exit_status
>> (apr_status_t) 70005
>> (lldb) p child_exit_code
>> (int) 10
>> (lldb) p child_exit_why
>> (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE
>>
>>
>>
>> That means svnserve crashed with a SIGBUS. I'll keep investigating.
>>
>>
>> The SIGBUS is real, but it's not in svnserve – it's in the forked ra-test
>> and happens before the exec() of svnserve. I had to build apr and apr-util
>> with debugging enabled to find this, and here's the relevant part of the
>> stack. For context: On Unix, apr_proc_create() forks the parent then runs
>> cleanups on the global pool before exec()'ing the real child.
>>
>>
>> (lldb) down
>> frame #3: 0x0000000101114b10 
>> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at 
>> apr_pools.c:2702:9
>>    2699          run_child_cleanups(&p->cleanups);
>>    2700      
>>    2701          for (p = p->child; p; p = p->sibling)
>> -> 2702              cleanup_pool_for_exec(p);
>>    2703      }
>>    2704      
>>    2705      APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
>> (lldb) p *p
>> (apr_pool_t) {
>>   parent = 0x000000087ac44028
>>   child = 0x000000087a844028
>>   sibling = 0x000000087ac94028
>>   ref = 0x000000087ac44030
>>   cleanups = NULL
>>   free_cleanups = 0x000000087ac8e7e8
>>   allocator = 0x0000000101500b40
>>   subprocesses = NULL
>>   abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure 
>> at pool.c:53)
>>   user_data = NULL
>>   tag = 0x0000000000000000
>>   active = 0x000000087a800000
>>   self = 0x000000087ac8e000
>>   self_first_avail = 0x000000087ac8e0a0 "file"
>>   pre_cleanups = NULL
>> }
>> (lldb) down
>> frame #2: 0x0000000101114b10 
>> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at 
>> apr_pools.c:2702:9
>>    2699          run_child_cleanups(&p->cleanups);
>>    2700      
>>    2701          for (p = p->child; p; p = p->sibling)
>> -> 2702              cleanup_pool_for_exec(p);
>>    2703      }
>>    2704      
>>    2705      APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
>> (lldb) p *p
>> (apr_pool_t) {
>>   parent = 0x0000000000000a35
>>   child = 0x0000000000000011
>>   sibling = 0x0000000000000002
>>   ref = 0x0000000060232b75
>>   cleanups = 0x0000000000000001
>>   free_cleanups = 0x0000000000000003
>>   allocator = 0x0000000000000011
>>   subprocesses = 0x0000000000000059
>>   abort_fn = 0x0000000000000005
>>   user_data = 0x00000000403dbe48
>>   tag = 0x0000000000000001 ""
>>   active = 0x0000000000000002
>>   self = 0x000000000000006a
>>   self_first_avail = 0x0000000000000001 ""
>>   pre_cleanups = 0x0000000000000006
>> }
>>
>>
>> Note he pool structure from frame #2: it's a clear case of memory
>> corruption. Most of the pointers there are invalid, some are also
>> misaligned. Calling run_child_cleanups(&p->cleanups) triggers an
>> immediate SIGBUS because cleanups is misaligned. The fun part now will
>> be trying to find out if this is a bug in APR pools – unlikely – or a
>> memory corruption bug in ra-test or libsvn_ra – also unlikely but it has to
>> be one of these.
>>
>> What finally pointed me in the right direction is that for every crash of
>> ra-test, *two* core files were created. One from the parent which
>> aborted because the child's exit code was non-zero, and one from the child.
>> Well, at least it won't be boring.
>>
>
> Great detective work! Have we seen this on anything other than macOS
> (sorry I can't remember the details)?
>
> It is a bit over my head, but shout out if there is anything I can do to
> help.
>

We see intermittent failures in the Linux CI, but I don't know if it's the
same issue or something else.

I could only reproduce this when running tests in parallel.

-- Brane

Re: Failing ra-test

Reply via email to