Re: Failing ra-test

Branko Čibej Thu, 04 Jun 2026 01:07:52 -0700

On 1. 6. 2026 11:42, Branko Čibej wrote:

On Sun, 31 May 2026, 16:07 Branko Čibej, <[email protected]> wrote:


    On Sun, 31 May 2026, 15:24 Daniel Sahlberg,
    <[email protected]> wrote:

        Den lör 30 maj 2026 kl 16:54 skrev Branko Čibej
        <[email protected]>:

            On 29. 5. 2026 12:53, Branko Čibej wrote:

            On 29. 5. 2026 12:16, Branko Čibej wrote:

            On 29. 5. 2026 08:50, Branko Čibej wrote:

            On 29. 5. 2026 00:17, Jun Omae wrote:

            Hi,

            On 2026/05/29 4:37, Branko Čibej wrote:

            This intermittent failure shows up sometimes on both Linux and 
macOS:

            FAIL:  ra-test: Unknown test failure (-11); see tests.log.

            That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same 
happens in repos-test, too, and it doesn't matter if its local or DAV or 
svnserve. I suspect that we have a wonderful little bug in our code. I've been 
trying, on and off, to trace this down either with a debugger or with clang's 
address sanitiser or with APR's pool debugging enabled, but it's an elusive 
little bxtrd and I've not yet been able to track it down. Not even so far as to 
figure out whether it's hiding in FSFS or somewhere else.

            It's been a few years and I think I remember seeing it on the 1.14 
branch, too. On the other hand, I don't recall any bug reports that could be 
linked to this observation. We're seeing this failure in the CI tests, too, at 
least with autotools. I haven't seen this with CMake, but failures in those 
workflows are far more often related to vcpkg or other environmental stuff, so 
it's a bit hard to find.

            If anyone has any idea where to look without diving into a 
line-by-line review of the code, please share your thoughts. This is starting 
to get on my nerves, just a little bit.

            The issue occurs in serf_default_destroy_and_data() via 
apr_pool_cleanup_for_exec() from the child process after apr_proc_create() to 
open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or serf 
have something wrong. Also, I guess that the same issue might occur when a hook 
is invoked with SVNMasterURI enabled on Apache mpm event or worker.


            It happens without DAV, too, and with prefork mpm – the
            only one that even works on macOS, so I always use that
            for testing.



            This is what I caught on macOS. It was davautocheck, but
            Serf is not involved:

            (lldb) target create 
"/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" --core 
"/cores/core.9605"
            Core file '/cores/core.9605' (arm64) was loaded.
            (lldb) bt
            * thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
               * frame #0: 0x0000000104a83524 
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
                 frame #1: 0x0000000104d6b5f0 
libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at 
client.c:597:5
                 frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy 
+ 88
                 frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy 
+ 68
                 frame #4: 0x0000000104b353a0 
libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, 
corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, 
repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, 
callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, 
config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
                 frame #5: 0x0000000104a7992c 
ra-test`tunnel_callback_test(opts=0x000000016b3826b8, pool=0x000000072d03c028) 
at ra-test.c:469:3
                 frame #6: 0x0000000104aba8a0 
libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, 
data=0x000000016b382570) at svn_test_main.c:577:15
                 frame #7: 0x0000000187549c58 
libsystem_pthread.dylib`_pthread_start + 136


            Note that this test runs against svnserve with just
            'check' too, no DAV involved at all. So it's either
            caused by something in libsvn_ra_svn, or more likely, in
            ra-test itself. Those tunnels are notoriously tricky, I
            remember we had trouble with them before in JavaHL, too.


            Finally got some interesting info from the core file – it
            takes some work to get core dumps on the Mac, setting
            ulimit -c isn't enough.

            frame #0: 0x0000000104a83524 
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
                326             apr_proc_wait(b->proc, &child_exit_code, 
&child_exit_why, APR_WAIT);
                327     
                328           SVN_TEST_ASSERT_NO_RETURN(child_exit_status == 
APR_CHILD_DONE);
            -> 329         SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
                330           SVN_TEST_ASSERT_NO_RETURN(child_exit_why == 
APR_PROC_EXIT);
                331         }
                332     }
            (lldb) p child_exit_status
            (apr_status_t) 70005
            (lldb) p child_exit_code
            (int) 10
            (lldb) p child_exit_why
            (apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE


            That means svnserve crashed with a SIGBUS. I'll keep
            investigating.


            The SIGBUS is real, but it's not in svnserve – it's in the
            forked ra-test and happens before the exec() of svnserve.
            I had to build apr and apr-util with debugging enabled to
            find this, and here's the relevant part of the stack. For
            context: On Unix, apr_proc_create() forks the parent then
            runs cleanups on the global pool before exec()'ing the
            real child.


            (lldb) down
            frame #3: 0x0000000101114b10 
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at 
apr_pools.c:2702:9
                2699        run_child_cleanups(&p->cleanups);
                2700    
                2701        for (p = p->child; p; p = p->sibling)
            -> 2702          cleanup_pool_for_exec(p);
                2703    }
                2704    
                2705    APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
            (lldb) p *p
            (apr_pool_t) {
               parent = 0x000000087ac44028
               child = 0x000000087a844028
               sibling = 0x000000087ac94028
               ref = 0x000000087ac44030
               cleanups = NULL
               free_cleanups = 0x000000087ac8e7e8
               allocator = 0x0000000101500b40
               subprocesses = NULL
               abort_fn = 0x0000000101018d94 
(libsvn_subr-1.0.dylib`abort_on_pool_failure at pool.c:53)
               user_data = NULL
               tag = 0x0000000000000000
               active = 0x000000087a800000
               self = 0x000000087ac8e000
               self_first_avail = 0x000000087ac8e0a0 "file"
               pre_cleanups = NULL
            }
            (lldb) down
            frame #2: 0x0000000101114b10 
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at 
apr_pools.c:2702:9
                2699        run_child_cleanups(&p->cleanups);
                2700    
                2701        for (p = p->child; p; p = p->sibling)
            -> 2702          cleanup_pool_for_exec(p);
                2703    }
                2704    
                2705    APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
            (lldb) p *p
            (apr_pool_t) {
               parent = 0x0000000000000a35
               child = 0x0000000000000011
               sibling = 0x0000000000000002
               ref = 0x0000000060232b75
               cleanups = 0x0000000000000001
               free_cleanups = 0x0000000000000003
               allocator = 0x0000000000000011
               subprocesses = 0x0000000000000059
               abort_fn = 0x0000000000000005
               user_data = 0x00000000403dbe48
               tag = 0x0000000000000001 ""
               active = 0x0000000000000002
               self = 0x000000000000006a
               self_first_avail = 0x0000000000000001 ""
               pre_cleanups = 0x0000000000000006
            }


            Note he pool structure from frame #2: it's a clear case of
            memory corruption. Most of the pointers there are invalid,
            some are also misaligned. Calling
            run_child_cleanups(&p->cleanups) triggers an immediate
            SIGBUS because cleanups is misaligned. The fun part now
            will be trying to find out if this is a bug in APR pools –
            unlikely – or a memory corruption bug in ra-test or
            libsvn_ra – also unlikely but it has to be one of these.

            What finally pointed me in the right direction is that for
            every crash of ra-test, /two/ core files were created. One
            from the parent which aborted because the child's exit
            code was non-zero, and one from the child. Well, at least
            it won't be boring.


        Great detective work! Have we seen this on anything other than
        macOS (sorry I can't remember the details)?

        It is a bit over my head, but shout out if there is anything I
        can do to help.


    We see intermittent failures in the Linux CI, but I don't know if
    it's the same issue or something else.

    I could only reproduce this when running tests in parallel.

I should expand on this a bit. ra-test limits the number of concurrentthreads to 4, and I saw far less than 1% of test runs fail. But I have16 cores on this laptop and when I increased the thread limit to 16,the failure rate was closer to 30%. Definitely something related topools and concurrency.

I've looked carefully through the APR pool code and from what I couldsee it's thread safe as long as the allocator has a mutex. Our C testdriver however creates a separate allocator for each thread that doesnot have a mutex. This should be safe because access to that allocatorand its pool is single-threaded.

But after fork(), the child process runs all child cleanups from theglobal pool downwards. That access is single threaded too, but it'spossible that some interaction between the global pool's mutex andfork() leaves part of the pool hierarchy only partially initialised.That would be ... unexpected to say the least, but it's the onlyhypothesis I have for now.

I'll try removing the "optimisation" that creates per-threadallocators without mutexes, I think it's uncalled for and these aretests anyway. We won't have all that much contention on that mutex –it's only used when pools are created and destroyed, not when they'recleared or allocated from.

It turns out that this changes nothing at all. I'm starting to suspectthat it is an unintended interaction between APR pools andapr_proc_create()/fork() – so essentially a bug in APR.


-- Brane

Re: Failing ra-test

Reply via email to