Hi again Olivier,
(I re-add the mailing that you skipped in your reply)
Le 01/08/2016 à 22:48, Olivier Doucet a écrit :
Hello Cyril,
2016-08-01 22:36 GMT+02:00 Cyril Bonté <[email protected]
<mailto:[email protected]>>:
Hi Olivier,
Le 29/07/2016 à 17:25, Olivier Doucet a écrit :
hello,
I'm trying to reload haproxy, but the new process failed to
start with a
segmentation fault. Fortunately, the old process is still here so no
downtime occurs (just that the config is not reloaded).
I'm using version 1.6.5 (did not update due to the SSL bug, and
as I'm
using CentOS+ DH keys, I wanted to wait for the fix in 1.6.7).
i'll try
to reproduce it with 1.6.7 on monday (no update on friday,
that's the
rule !).
I think this is related to the usage of server-state-file.
I have at least one server DOWN in this file.
Emptying server-state-file file and reloading HAProxy makes it work
correctly.
Unfortunately, config file is very huge and cannot be produced here.
I couldn't spend a lot of time on this issue and failed to find a
configuration file that triggers the segfault.
Me too. I tried to reproduce it today with a smaller configfile with no
success.
But looking at the code, I tend to think that your configuration is
using "slowstart" on some servers and that's where it crashes.
I confirm you I have the following line in defaults section :
default-server maxconn 2000 fall 3 rise 1 inter 2500ms fastinter
1000ms downinter 5000ms slowstart 30s
Can you check your configuration and verify that you don't have
duplicate server names in the same backends ? (Or can you provide me
your state file, in private this time ;-) ).
For example, this configuration will segfault :
global
stats socket /tmp/haproxy.sock level admin
server-state-file ./haproxy.state
defaults
load-server-state-from-file global
default-server maxconn 2000 fall 3 rise 1 inter 2500ms fastinter
1000ms downinter 5000ms slowstart 30s
backend test0
server s0 127.0.0.1:81 check # DOWN
server s0 127.0.0.1:80 check # UP
And the state file contains (I remove the headers) :
2 test0 1 s0 127.0.0.1 0 0 1 1 5 8 2 0 6 0 0 0
2 test0 2 s0 127.0.0.1 2 0 1 1 5 6 3 3 6 0 0 0
First, haproxy will set the server DOWN, then it will find a second line
setting it to UP, which make srv_set_running() enter the code producing
the segfault.
For the new bug you found, that's great.
Tomorrow I'll investigate more to find a way to reproduce my bug.
Olivier
in srv_set_running() [server.c], we have :
if (s->slowstart > 0) {
task_schedule(s->warmup, tick_add(now_ms,
MS_TO_TICKS(MAX(1000, s->slowstart / 20))));
}
but it looks like s->warmup is NULL. Maybe it should be initialized
as it is done in the start_checks() function (in checks.c).
Unfortunately, trying to find a configuration file that could
reproduce the issue, I found other crash cases :-(
Steps to reproduce :
1. Use the following configuration file :
global
stats socket /tmp/haproxy.sock level admin
server-state-file ./haproxy.state
defaults
load-server-state-from-file global
backend test0
server s0 127.0.0.1:80 <http://127.0.0.1:80>
2. run haproxy with it.
3. save the server state file :
$ echo "show servers state"|socat stdio /tmp/haproxy.sock >
./haproxy.state
4. stop haproxy and add the keyword "disabled" on the backend "test"0 :
global
stats socket /tmp/haproxy.sock level admin
server-state-file ./haproxy.state
defaults
load-server-state-from-file global
backend test0
disabled
server s0 127.0.0.1:80 <http://127.0.0.1:80>
5. start haproxy
=> It will die with SIGFPE
This happens because a disabled proxy it not initialized and
variables used to divide weights are set to 0 :
0x000000000043c53a in server_recalc_eweight (sv=sv@entry=0x7b7bd0)
at src/server.c:750
750 sv->eweight = (sv->uweight * w + px->lbprm.wmult -
1) / px->lbprm.wmult;
(gdb) bt
#0 0x000000000043c53a in server_recalc_eweight
(sv=sv@entry=0x7b7bd0) at src/server.c:750
#1 0x000000000043fd97 in srv_update_state (params=0x7fffffffbbd0,
version=1, srv=0x7b7bd0) at src/server.c:2141
#2 apply_server_state () at src/server.c:2473
#3 0x0000000000416b0e in init (argc=<optimized out>,
argv=<optimized out>, argv@entry=0x7fffffffe128) at src/haproxy.c:839
#4 0x0000000000414989 in main (argc=<optimized out>,
argv=0x7fffffffe128) at src/haproxy.c:1635
Main informations :
nbcores 7
server-state-file /tmp/haproxy_server_state
HAProxy -vv output:
HA-Proxy version 1.6.5 2016/05/10
Copyright 2000-2016 Willy Tarreau <[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
Build options :
TARGET = linux2628
CPU = native
CC = gcc
CFLAGS = -O2 -march=native -g -fno-strict-aliasing
-Wdeclaration-after-statement
OPTIONS = USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024,
maxpollevents = 200
Encrypted password support via crypt(3): yes
Built without compression support (neither USE_ZLIB nor USE_SLZ
are set)
Compression algorithms supported : identity("identity")
Built with OpenSSL version : OpenSSL 1.0.2h 3 May 2016
Running on OpenSSL version : OpenSSL 1.0.2h 3 May 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 7.2 2007-06-19
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
As I am using nbcores > 1, the server state file is built like
this :
> /tmp/haproxy_server_state
for i in $(/bin/ls /var/run/haproxy-*.sock); do
socat $i - <<< "show servers state" >>
/tmp/haproxy_server_state
done
gdb infos :
Program terminated with signal 11, Segmentation fault.
#0 srv_set_running () at include/proto/task.h:244
#1 0x000000000042ee3a in apply_server_state () at src/server.c:2069
#2 0x0000000000405618 in init () at src/haproxy.c:839
#3 0x0000000000407049 in main () at src/haproxy.c:1635
bt full gives this :
#0 srv_set_running () at include/proto/task.h:244
srv_keywords = {scope = 0x0, list = {n = 0x8b1258, p =
0x8b90e8}, kw = 0x8b12a0}
srv_kws = {scope = 0x6074f2 "ALL", list = {n = 0x8b4d68, p =
0x8b12a8}, kw = 0x8b1250}
#1 0x000000000042ee3a in apply_server_state () at src/server.c:2069
srv_keywords = {scope = 0x0, list = {n = 0x8b1258, p =
0x8b90e8}, kw = 0x8b12a0}
srv_kws = {scope = 0x6074f2 "ALL", list = {n = 0x8b4d68, p =
0x8b12a8}, kw = 0x8b1250}
#2 0x0000000000405618 in init () at src/haproxy.c:839
[...]
trash = {
str = 0x23b1910 "Server OBFUSCATED:80/OBFUSCATED is DOWN,
changed from server-state after a reload. 1 active and 0 backup
servers
left. 0 sessions active, 0 requeued, 0 remaining in queue", size =
16384, len = 168}
[...]
I can provide for the coredump in private with my haproxy binary
if it
helps.
--
Cyril Bonté
--
Cyril Bonté