On Fri, Oct 17, 2025 at 10:57 PM Maria Matejka <[email protected]> wrote: > > Hello Ze Xia, > > this looks like a real bug, yet I'm not sure whether we happen to observe it > in real world often. Please, do you have any instructions how to trigger it > reliably so that we can add it to our CI? > > Thanks, > Maria >
I tried to trigger it "naturally" by creating 2 bird daemons connected
through veth-pair, this fails to reproduce the bug. According to
strace, connect() always returns -1 with errno = EINPROGRESS.
However, I figured out that I can wait a little while for connect() to
success by preloading a custom dynamically-linked library. My current
implementation:
#include <dlfcn.h>
#include <errno.h>
#include <poll.h>
#include <sys/socket.h>
// milliseconds
#define MAX_CONNECT_BLOCKTIME 10
typedef int (*connect_t)(int, const struct sockaddr *, socklen_t);
__attribute__((visibility("default"))) int connect(int sock, const
struct sockaddr *addr, socklen_t len)
{
int orig_errno = errno;
connect_t true_connect = dlsym(RTLD_NEXT, "connect");
int r = true_connect(sock, addr, len);
if (!(addr->sa_family == AF_INET && r == -1 && errno == EINPROGRESS))
return r;
struct pollfd fds[1] = {{.fd = sock,
.events = POLLOUT | POLLERR | POLLHUP}};
int poll_res = poll(fds, 1, MAX_CONNECT_BLOCKTIME);
if (poll_res == 0)
{
errno = EINPROGRESS;
return -1;
}
int err;
socklen_t errlen = sizeof(err);
getsockopt(sock, SOL_SOCKET, SO_ERROR, &err, &errlen);
if (err == 0)
{
errno = orig_errno;
return 0;
}
else
{
errno = err;
return -1;
}
}
Compile it with:
gcc bird-preload.c -fPIC -fvisibility=hidden -shared -o libpreload.so
Then write the absolute path of libpreload.so to /etc/ld.so.preload
(man ld.so for more information about LD_PRELOAD). I started 2 bird
daemons inside a docker container with config file as in attachment of
this mail, and connected them with veth-pair. When this libpreload.so
is preloaded, the connect retry timer (2s) should fire every time and
tears down the connection, causing a reconnection, which can be
checked in the debug log.
With the libpreload.so, bird should behave just like the thread does
not get scheduled for a while (<10ms) when calling connect(), it seems
to have no other side-effect to me. I'm not sure does this fits in
your CI workflow though. Hope this helps!
Regards,
Ze Xia
node_1.conf
Description: Binary data
node_2.conf
Description: Binary data
