On Fri, Oct 17, 2025 at 10:57 PM Maria Matejka <[email protected]> wrote:
>
> Hello Ze Xia,
>
> this looks like a real bug, yet I'm not sure whether we happen to observe it 
> in real world often. Please, do you have any instructions how to trigger it 
> reliably so that we can add it to our CI?
>
> Thanks,
> Maria
>

I tried to trigger it "naturally" by creating 2 bird daemons connected
through veth-pair, this fails to reproduce the bug. According to
strace, connect() always returns -1 with errno = EINPROGRESS.

However, I figured out that I can wait a little while for connect() to
success by preloading a custom dynamically-linked library. My current
implementation:

#include <dlfcn.h>
#include <errno.h>
#include <poll.h>
#include <sys/socket.h>

// milliseconds
#define MAX_CONNECT_BLOCKTIME 10

typedef int (*connect_t)(int, const struct sockaddr *, socklen_t);

__attribute__((visibility("default"))) int connect(int sock, const
struct sockaddr *addr, socklen_t len)
{
    int orig_errno = errno;

    connect_t true_connect = dlsym(RTLD_NEXT, "connect");
    int r = true_connect(sock, addr, len);
    if (!(addr->sa_family == AF_INET && r == -1 && errno == EINPROGRESS))
        return r;

    struct pollfd fds[1] = {{.fd = sock,
                             .events = POLLOUT | POLLERR | POLLHUP}};
    int poll_res = poll(fds, 1, MAX_CONNECT_BLOCKTIME);
    if (poll_res == 0)
    {
        errno = EINPROGRESS;
        return -1;
    }
    int err;
    socklen_t errlen = sizeof(err);
    getsockopt(sock, SOL_SOCKET, SO_ERROR, &err, &errlen);
    if (err == 0)
    {
        errno = orig_errno;
        return 0;
    }
    else
    {
        errno = err;
        return -1;
    }
}

Compile it with:

gcc bird-preload.c -fPIC -fvisibility=hidden -shared -o libpreload.so

Then write the absolute path of libpreload.so to /etc/ld.so.preload
(man ld.so for more information about LD_PRELOAD). I started 2 bird
daemons inside a docker container with config file as in attachment of
this mail, and connected them with veth-pair. When this libpreload.so
is preloaded, the connect retry timer (2s) should fire every time and
tears down the connection, causing a reconnection, which can be
checked in the debug log.

With the libpreload.so, bird should behave just like the thread does
not get scheduled for a while (<10ms) when calling connect(), it seems
to have no other side-effect to me. I'm not sure does this fits in
your CI workflow though. Hope this helps!

Regards,
Ze Xia

Attachment: node_1.conf
Description: Binary data

Attachment: node_2.conf
Description: Binary data

Reply via email to