Bug#1035983: libsoup3 (and libsoup2) autopkgtests are flaky: Address already in use: AH00072: make_sock: could not bind to address 127.0.0.1:47524

Simon McVittie Sat, 12 Jul 2025 10:13:24 -0700

Control: clone 1035983 -2
Control: retitle 1035983 libsoup3: intermittent test failures: Address already 
in use: AH00072: make_sock: could not bind to address 127.0.0.1:xxx
Control: retitle -2 libsoup3: [metabug] several intermittent test failures 
resulting in flaky autopkgtests and FTBFS
Control: unblock 1035983 by 1109107 1109108
Control: block -2 by 1035983


On Mon, 19 May 2025 at 17:57:50 +0200, Santiago Vila wrote:

El 19/5/25 a las 16:43, Simon McVittie escribió:

Is this still the same failure mode described in the bug title, with "Address already in 
use" and "could not bind to address ..." being reported by Apache?


That's a very good question and I'm glad that you asked :-)

In some cases, yes, but not always.

Bug #1035983 has always mentioned the AH00072 issue in its title, so Ithink it's probably best if we consider any other sources of FTBFS orautopkgtest failures as out-of-scope for #1035983.


Regarding the topic of flaky tests in general:

Unfortunately I suspect that what's happening here is that we have aseries of different test failures, each of them individually quite rare(therefore hard to reproduce or debug), which add up to a significantprobability that at least one of the rare failures will happen at leastonce in any given test run and therefore the overall test suite fails.I've cloned a "metabug" (-2 above) to be blocked by #1035983 and otherconcrete and potentially actionable causes of test failures, but thatmetabug is not going to be directly actionable, because issues thatcan't be identified can't be fixed: the only way it can be solved is tochip away at its actionable dependencies until the failure rate becomessufficiently low. I am not an expert on this package and I cannot committo being able to achieve that.

Individual tests that are sufficiently flaky can be worked around bydisabling or ignoring the test if necessary (as was done for thetls_interaction test already), but the cost of disabling tests is thatwe can no longer use them to detect RC-severity regressions(particularly on architectures with few users where the buildds andautopkgtest are basically the only tools we have), so there's atrade-off here between breakage caused by false-positive failures andbreakage caused by regressions that could have been caught by runningthe tests. As a non-expert trying to keep this package afloat, I don'tfeel that I am able to make high-quality uploads without automated teststo detect my inevitable mistakes. I'm sorry that this is disappointing,and I would be delighted to stop contributing to libsoup when someonecan do a better job, but until then all I can do is to try to have anet-positive impact to the best of my limited ability.

As mentioned previously, the AH00072 issue, #1035983, is particularly badfor this because it affects several tests equally, and disabling all ofthem would lose a lot of the overall test coverage.

I've put a collection
of failed build logs here:

https://people.debian.org/~sanvila/build-logs/libsoup3/

Thanks, hopefully someone can analyze those at some point and pick outthe actionable equivalence classes. I cannot commit to being able to dothis myself.

I've reported some other sources of intermittent test failures as#1109107 (no solution known, help welcome), #1109108 (no solution known,help welcome) and #1109120 (fixed in the latest upload to unstable by anupstream change). None of these are, individually, a high probability offailure, but they add up.

When I tried running the test suite repeatedly on barriere, the failuremodes I saw intermittently were #1109107 and #1109108. I don't think Isaw #1109120 or #1035983, so those might be less common, at least onthat particular machine (if the failures are timing-dependent then theymight behave differently elsewhere).


Regarding #1035983 (the AH00072 issue) specifically:

Last time I looked at the libsoup* test suite, the actual tests wereeach reasonably reliable, but the reliability issue was with theirsetup/teardown. They run a temporary Apache web server, in order tohave a realistic server to test against. I think what's happening isthat sometimes, the web server port from one test (let's say testnumber 5) is still considered by the kernel to be in use by the timewe reach the setup stage of the next test (let's say test number 6).
As a result, the Apache for test number 6 can't listen on the port ithas been configured to use, and testing fails at that point.

I tried applying the attached patch as a brute-force attempt to solvethe port-still-in-use problem (#1035983). (FYI this will not applycleanly to upstream code, it requires other changes already indebian/patches to add more debug info, which I added last time I spenttime on trying to figure this out.)

Unfortunately it didn't work: the test made multiple attempts to startApache, but they all failed with the same error message shown in theSubject, until the overall test timed out. That suggests that my theoryabout the web server port being in TIME_WAIT state might not have beencorrect. I don't know what else to try there.

In 3.6.5-2 I added a patch fixing an upstream issue where one of thetests that used Apache was not marked "don't run in parallel", so itcould end up being run in parallel with other tests - that could haveresulted in a similar failure mode. We can see whether that helps. Ithink I've still seen the AH00072 error occasionally even after makingthat change, though, so it can't be the whole story.


    smcv

From: Simon McVittie <[email protected]>
Date: Fri, 11 Jul 2025 13:27:41 +0100
Subject: tests: If we can't start Apache, wait a bit and try again

Maybe helps: #1035983
---
 tests/test-utils.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/tests/test-utils.c b/tests/test-utils.c
index 37cd00b..0de446c 100644
--- a/tests/test-utils.c
+++ b/tests/test-utils.c
@@ -234,9 +234,13 @@ apache_cmd (const char *cmd)
 	return ok;
 }
 
+static const unsigned int MAX_START_APACHE_TRIES = 10;
+
 void
 apache_init (void)
 {
+	unsigned int i = 0;
+
 	g_test_message ("[%f] enter %s", g_get_monotonic_time () / 1e6, G_STRFUNC);
 
 	/* Set this environment variable if you are already running a
@@ -246,11 +250,22 @@ apache_init (void)
 
 	server_root = soup_test_build_filename_abs (G_TEST_BUILT, "", NULL);
 
-	if (!apache_cmd ("start")) {
-		g_printerr ("Could not start apache\n");
-		exit (1);
+	while (TRUE) {
+		if (apache_cmd ("start")) {
+			apache_running = TRUE;
+			goto out;
+		} else {
+			g_test_message ("[%f] Could not start Apache", g_get_monotonic_time () / 1e6);
+		}
+
+		if (++i > MAX_START_APACHE_TRIES) {
+			g_printerr ("Could not start apache\n");
+			exit (1);
+		} else {
+			g_test_message ("Will wait a bit and try again");
+			g_usleep (10 * G_USEC_PER_SEC);
+		}
 	}
-	apache_running = TRUE;
 
 out:
 	g_test_message ("[%f] leave %s", g_get_monotonic_time () / 1e6, G_STRFUNC);

Bug#1035983: libsoup3 (and libsoup2) autopkgtests are flaky: Address already in use: AH00072: make_sock: could not bind to address 127.0.0.1:47524

Reply via email to