Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej <[email protected]>:

> On 29. 5. 2026 12:53, Branko Čibej wrote:
>
> On 29. 5. 2026 12:16, Branko Čibej wrote:
>
> On 29. 5. 2026 08:50, Branko Čibej wrote:
>
> On 29. 5. 2026 00:17, Jun Omae wrote:
>
> Hi,
>
> On 2026/05/29 4:37, Branko Čibej wrote:
>
> This intermittent failure shows up sometimes on both Linux and macOS:
>
> FAIL:  ra-test: Unknown test failure (-11); see tests.log.
>
> That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in 
> repos-test, too, and it doesn't matter if its local or DAV or svnserve. I 
> suspect that we have a wonderful little bug in our code. I've been trying, on 
> and off, to trace this down either with a debugger or with clang's address 
> sanitiser or with APR's pool debugging enabled, but it's an elusive little 
> bxtrd and I've not yet been able to track it down. Not even so far as to 
> figure out whether it's hiding in FSFS or somewhere else.
>
> It's been a few years and I think I remember seeing it on the 1.14 branch, 
> too. On the other hand, I don't recall any bug reports that could be linked 
> to this observation. We're seeing this failure in the CI tests, too, at least 
> with autotools. I haven't seen this with CMake, but failures in those 
> workflows are far more often related to vcpkg or other environmental stuff, 
> so it's a bit hard to find.
>
> If anyone has any idea where to look without diving into a line-by-line 
> review of the code, please share your thoughts. This is starting to get on my 
> nerves, just a little bit.
>
>
> The issue occurs in serf_default_destroy_and_data() via 
> apr_pool_cleanup_for_exec() from the child process after apr_proc_create() to 
> open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or serf 
> have something wrong. Also, I guess that the same issue might occur when a 
> hook is invoked with SVNMasterURI enabled on Apache mpm event or worker.
>
>
> It happens without DAV, too, and with prefork mpm – the only one that even
> works on macOS, so I always use that for testing.
>
>
>
> This is what I caught on macOS. It was davautocheck, but Serf is not
> involved:
>
> (lldb) target create 
> "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core 
> "/cores/core.9605"
> Core file '/cores/core.9605' (arm64) was loaded.
> (lldb) bt
> * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
>   * frame #0: 0x0000000104a83524 
> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>     frame #1: 0x0000000104d6b5f0 
> libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at 
> client.c:597:5
>     frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
>     frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
>     frame #4: 0x0000000104b353a0 
> libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, 
> corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, 
> repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, 
> callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, 
> config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
>     frame #5: 0x0000000104a7992c 
> ra-test`tunnel_callback_test(opts=0x000000016b3826b8, 
> pool=0x000000072d03c028) at ra-test.c:469:3
>     frame #6: 0x0000000104aba8a0 
> libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, 
> data=0x000000016b382570) at svn_test_main.c:577:15
>     frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136
>
>
>
> Note that this test runs against svnserve with just 'check' too, no DAV
> involved at all. So it's either caused by something in libsvn_ra_svn, or
> more likely, in ra-test itself. Those tunnels are notoriously tricky, I
> remember we had trouble with them before in JavaHL, too.
>
>
> Finally got some interesting info from the core file – it takes some work
> to get core dumps on the Mac, setting ulimit -c isn't enough.
>
> frame #0: 0x0000000104a83524 
> ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
> tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
>    326                apr_proc_wait(b->proc, &child_exit_code, 
> &child_exit_why, APR_WAIT);
>    327        
>    328              SVN_TEST_ASSERT_NO_RETURN(child_exit_status == 
> APR_CHILD_DONE);
> -> 329              SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
>    330              SVN_TEST_ASSERT_NO_RETURN(child_exit_why == 
> APR_PROC_EXIT);
>    331            }
>    332        }
> (lldb) p child_exit_status
> (apr_status_t) 70005
> (lldb) p child_exit_code
> (int) 10
> (lldb) p child_exit_why
> (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE
>
>
>
> That means svnserve crashed with a SIGBUS. I'll keep investigating.
>
>
> The SIGBUS is real, but it's not in svnserve – it's in the forked ra-test
> and happens before the exec() of svnserve. I had to build apr and apr-util
> with debugging enabled to find this, and here's the relevant part of the
> stack. For context: On Unix, apr_proc_create() forks the parent then runs
> cleanups on the global pool before exec()'ing the real child.
>
>
> (lldb) down
> frame #3: 0x0000000101114b10 
> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at 
> apr_pools.c:2702:9
>    2699           run_child_cleanups(&p->cleanups);
>    2700       
>    2701           for (p = p->child; p; p = p->sibling)
> -> 2702               cleanup_pool_for_exec(p);
>    2703       }
>    2704       
>    2705       APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
> (lldb) p *p
> (apr_pool_t) {
>   parent = 0x000000087ac44028
>   child = 0x000000087a844028
>   sibling = 0x000000087ac94028
>   ref = 0x000000087ac44030
>   cleanups = NULL
>   free_cleanups = 0x000000087ac8e7e8
>   allocator = 0x0000000101500b40
>   subprocesses = NULL
>   abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure 
> at pool.c:53)
>   user_data = NULL
>   tag = 0x0000000000000000
>   active = 0x000000087a800000
>   self = 0x000000087ac8e000
>   self_first_avail = 0x000000087ac8e0a0 "file"
>   pre_cleanups = NULL
> }
> (lldb) down
> frame #2: 0x0000000101114b10 
> libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at 
> apr_pools.c:2702:9
>    2699           run_child_cleanups(&p->cleanups);
>    2700       
>    2701           for (p = p->child; p; p = p->sibling)
> -> 2702               cleanup_pool_for_exec(p);
>    2703       }
>    2704       
>    2705       APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
> (lldb) p *p
> (apr_pool_t) {
>   parent = 0x0000000000000a35
>   child = 0x0000000000000011
>   sibling = 0x0000000000000002
>   ref = 0x0000000060232b75
>   cleanups = 0x0000000000000001
>   free_cleanups = 0x0000000000000003
>   allocator = 0x0000000000000011
>   subprocesses = 0x0000000000000059
>   abort_fn = 0x0000000000000005
>   user_data = 0x00000000403dbe48
>   tag = 0x0000000000000001 ""
>   active = 0x0000000000000002
>   self = 0x000000000000006a
>   self_first_avail = 0x0000000000000001 ""
>   pre_cleanups = 0x0000000000000006
> }
>
>
> Note he pool structure from frame #2: it's a clear case of memory
> corruption. Most of the pointers there are invalid, some are also
> misaligned. Calling run_child_cleanups(&p->cleanups) triggers an
> immediate SIGBUS because cleanups is misaligned. The fun part now will be
> trying to find out if this is a bug in APR pools – unlikely – or a memory
> corruption bug in ra-test or libsvn_ra – also unlikely but it has to be one
> of these.
>
> What finally pointed me in the right direction is that for every crash of
> ra-test, *two* core files were created. One from the parent which aborted
> because the child's exit code was non-zero, and one from the child. Well,
> at least it won't be boring.
>

Great detective work! Have we seen this on anything other than macOS (sorry
I can't remember the details)?

It is a bit over my head, but shout out if there is anything I can do to
help.

Cheers,
/Daniel

Reply via email to