Hello hackers,

Looking at a recent failure on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2024-04-30%2020%3A48%3A34

# poll_query_until timed out executing this query:
# SELECT archived_count FROM pg_stat_archiver
# expecting this output:
# 1
# last actual query output:
# 0
# with stderr:
# Looks like your test exited with 29 just after 4.
[23:01:41] t/020_archive_status.pl ..............
Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 12/16 subtests

with the following error in the log:
2024-04-30 22:57:27.931 CEST [83115:1] LOG:  archive command failed with exit 
code 1
2024-04-30 22:57:27.931 CEST [83115:2] DETAIL:  The failed archive command was: cp "pg_wal/000000010000000000000001_does_not_exist" "000000010000000000000001_does_not_exist"
...
2024-04-30 22:57:28.070 CEST [47962:2] [unknown] LOG:  connection authorized: user=pgbf database=postgres application_name=020_archive_status.pl
2024-04-30 22:57:28.072 CEST [47962:3] 020_archive_status.pl LOG: statement: 
SELECT archived_count FROM pg_stat_archiver
2024-04-30 22:57:28.073 CEST [83115:3] LOG:  could not send to statistics 
collector: Resource temporarily unavailable

and the corresponding code (on REL_13_STABLE):
static void
pgstat_send(void *msg, int len)
{
    int         rc;

    if (pgStatSock == PGINVALID_SOCKET)
        return;

    ((PgStat_MsgHdr *) msg)->m_size = len;

    /* We'll retry after EINTR, but ignore all other failures */
    do
    {
        rc = send(pgStatSock, msg, len, 0);
    } while (rc < 0 && errno == EINTR);

#ifdef USE_ASSERT_CHECKING
    /* In debug builds, log send failures ... */
    if (rc < 0)
        elog(LOG, "could not send to statistics collector: %m");
#endif
}

I wonder, whether this retry should be performed after EAGAIN (Resource
temporarily unavailable), EWOULDBLOCK as well.

With a simple send() wrapper (PFA) activated with LD_PRELOAD, I could
reproduce this failure easily when running
`make -s check -C src/test/recovery/ PROVE_TESTS="t/020*"` on
REL_13_STABLE:
t/020_archive_status.pl .. 1/16 # poll_query_until timed out executing this 
query:
# SELECT archived_count FROM pg_stat_archiver
# expecting this output:
# 1
# last actual query output:
# 0
# with stderr:
# Looks like your test exited with 29 just after 4.
t/020_archive_status.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 12/16 subtests

I also reproduced another failure (that lacks useful diagnostics, 
unfortunately):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-11-10%2015%3A30%3A16
...
t/020_archive_status.pl .. 8/16 # poll_query_until timed out executing this 
query:
# SELECT last_archived_wal FROM pg_stat_archiver
# expecting this output:
# 000000010000000000000002
# last actual query output:
# 000000010000000000000001
# with stderr:
# Looks like your test exited with 29 just after 13.
t/020_archive_status.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 3/16 subtests
...

The "n == 64" condition in the cranky send() is needed to aim exactly
these failures. Without this restriction the test (and also `make check`)
just hangs because of:
            if (errno == EINTR)
                continue;       /* Ok if we were interrupted */

            /*
             * Ok if no data writable without blocking, and the socket is in
             * non-blocking mode.
             */
            if (errno == EAGAIN ||
                errno == EWOULDBLOCK)
            {
                return 0;
            }
in internal_flush_buffer().

On the other hand, even with:
int
send(int s, const void *buf, size_t n, int flags)
{
    if (rand() % 10000 == 0)
    {
        errno = EINTR;
        return -1;
    }
    return real_send(s, buf, n, flags);
}

`make check` fails with many miscellaneous errors...

Best regards,
Alexander
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/types.h>
#include <stddef.h>
#include <dlfcn.h>
#include <errno.h>

static ssize_t (*real_send)(int s, const void *buf, size_t n, int flags) = NULL;

__attribute__((constructor))
void
lib_init(void)
{
	real_send = dlsym(RTLD_NEXT,"send");
}

int
send(int s, const void *buf, size_t n, int flags)
{
	if (n == 64 && rand() % 10 == 0)
	{
		errno = EAGAIN;
		return -1;
	}
	return real_send(s, buf, n, flags);
}

Reply via email to