Hi, Our config is quite complex and I'm trying to narrow it down. It is occurring only on one production haproxy cluster (which consists of 6 servers in each of two data centers) with significant load - crashes occurs on random servers so I would exclude memory corruption.
I'm suspecting SPOE or/and LUA script both are used to send metadata about each request to an external endpoint. Yesterday I disabled this feature in one datacenter to verify. Our build is done in docker (Ubuntu bionic) with kernel 4.9.184-linuxkit, crash is on Ubuntu bionic 4.15.0-55-generic, using: haproxy 2.0.17 openssl 1.1.1f pcre 8.44 lua 5.3.5 lrandom (PRNG for lua, we're using it for 2 or 3 years without any problems, and soon we will drop it from our build) compiled in following way: # LUA wget http://www.lua.org/ftp/lua-$LUA_VERSION.tar.gz \ && tar -zxf lua-$LUA_VERSION.tar.gz \ && cd lua-$LUA_VERSION \ && make linux test \ && make install # LUA LRANDOM wget http://webserver2.tecgraf.puc-rio.br/~lhf/ftp/lua/ar/lrandom-100.tar.gz && tar -zxf lrandom-100.tar.gz \ && make -C lrandom-100 \ && make -C lrandom-100 install # PCRE wget https://ftp.pcre.org/pub/pcre/pcre-$PCRE_VERSION.tar.gz \ && tar -zxf pcre-$PCRE_VERSION.tar.gz \ && cd pcre-$PCRE_VERSION \ && ./configure --prefix=/usr/lib/haproxy/pcre_$PCRE_VERSION --enable-jit --enable-utf --enable-unicode-properties --disable-silent-rules \ && make \ && make install # OPENSSL wget https://www.openssl.org/source/openssl-$SSL_VERSION.tar.gz \ && tar -zxf openssl-$SSL_VERSION.tar.gz \ && cd openssl-$SSL_VERSION \ && ./Configure --openssldir=/usr/lib/haproxy/openssl_$SSL_VERSION --prefix=/usr/lib/haproxy/openssl_$SSL_VERSION -Wl,-rpath=/usr/lib/haproxy/openssl_$SSL_VERSION/lib shared no-idea linux-x86_64 \ && make depend \ && make \ && make install_sw and finally haproxy is compiled using deb builder: override_dh_auto_build: make TARGET=$(HAP_TARGET) DEFINE="-DIP_BIND_ADDRESS_NO_PORT=24 -DMAX_SESS_STKCTR=12" USE_PCRE=1 USE_PCRE_JIT=1 PCRE_INC=/usr/lib/haproxy/pcre_$(PCRE_VERSION)/include PCRE_LIB="/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib -Wl,-rpath,/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib" USE_GETADDRINFO=1 USE_OPENSSL=1 SSL_INC=/usr/lib/haproxy/openssl_$(SSL_VERSION)/include SSL_LIB="/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib -Wl,-rpath,/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib" ADDLIB=-ldl USE_ZLIB=1 USE_DL=1 USE_LUA=1 USE_REGPARM=1 DIP_BIND_ADDRESS_NO_PORT is now absolete and we'll drop it MAX_SESS_STKCTR=12 we need more stick tables Kind regards, czw., 17 wrz 2020 o 08:18 Willy Tarreau <[email protected]> napisaĆ(a): > Hi guys, > > On Thu, Sep 17, 2020 at 11:05:31AM +1000, Igor Cicimov wrote: > (...) > > > Coredump fragment from thread1: > > > (gdb) bt > > > #0 0x000055cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at > > > src/mux_h2.c:783 > > So the code is this one: > > 777 static void __maybe_unused h2s_notify_recv(struct h2s *h2s) > 778 { > 779 struct wait_event *sw; > 780 > 781 if (h2s->recv_wait) { > 782 sw = h2s->recv_wait; > 783 sw->events &= ~SUB_RETRY_RECV; > 784 tasklet_wakeup(sw->tasklet); > 785 h2s->recv_wait = NULL; > 786 } > 787 } > > In the trace it's said that sw = 0xffffffff. Looking at all places where > h2s->recv_wait() is modified, it's either NULL or a valid pointer to some > structure. We could have imagined that for whatever reason h2s is wrong > here, but this call only happens when its state is still valid, and it > experiences double dereferences before landing here, which tends to > indicate that the h2s pointer is OK. Thus the only hypothesis I can have > for now is memory corruption :-/ That field would get overwritten with > (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not > as if we had many of these. > > > I'm not one of the devs but obviously many of us using v2.0 will be > > interested in the answer. Assuming you do not install from packages can > you > > please provide some more background on how you produce the binary, like > if > > you compile then what OS and kernel is this compiled on and what OS and > > kernel this crashes on? Again if compiled any other custom compiled > > packages in use, like OpenSSL, lua etc, you might be using or have > compiled > > haproxy against etc.? > > > > Also if this is a bug and you have hit some corner case with your config > > (many are using 2.0 but we have not seen crashes) you should provide a > > stripped down version (not too stripped though just the sensitive data) > of > > your config too. > > I agree with Igor here, any info to try to narrow down a reproducer, both > in terms of config and operations, would be tremendously helpful! > > Thanks, > Willy >

