Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert

On 2020-03-27 16:49, Olivier Houchard wrote:

On Fri, Mar 27, 2020 at 04:32:21PM +0100, Christian Ruppert wrote:

On 2020-03-27 16:27, Olivier Houchard wrote:
> On Fri, Mar 27, 2020 at 04:21:20PM +0100, Christian Ruppert wrote:
>> During the reload I just found something in the daemon log:
>> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
>> Starting proxy someotherlistener: cannot bind socket [0.0.0.0:18540]
>> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
>> Starting proxy someotherlistener: cannot bind socket [:::18540]
>>
>> So during the reload, this happened and seems to have caused any
>> further
>> issues/trouble.
>>
>
> That would make sense. Does that mean you have old processes hanging
> around ? Do you use seemless reload ? If so, it shouldn't attempt to
> bind the socket, but get them from the old process.

I remember that it was necessary to have a systemd wrapper around, as 
it

caused trouble otherwise, due to PID being changed etc.
Not sure if that wrapper is still in use. In this case it's systemd
though.
[Unit]
Description=HAProxy Load Balancer
After=network.target

[Service]
Environment="CONFIG=/etc/haproxy/haproxy.cfg" 
"PIDFILE=/run/haproxy.pid"

ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q
ExecStart=/usr/sbin/haproxy -Ws -f $CONFIG -p $PIDFILE
ExecReload=/usr/sbin/haproxy -f $CONFIG -c -q
ExecReload=/bin/kill -USR2 $MAINPID
KillMode=mixed
Restart=always
SuccessExitStatus=143
TimeoutStopSec=30
Type=notify


[...]


We've added the TimeoutStopSec=30 for some reason (I'd have to ask my
college, something took longer or something like that, since we have
quite a lot of frontends/listener/backend)
Only the two processes I mentioned before are / were running. Seems 
like

the fallback didn't work properly?



The wrapper is no longer needed, it has been superceeded by the
master-worker (which you seem to be using, given you're using -Ws).
It is possible the old process refuse to die, and you end up hitting 
the

timeout and it gets killed eventually, but it's too late.
Do you have a expose-fd listeners on the unix stats socket ? Using it
will allow the new process to connect to the old process' stats socket,
and get all the listening sockets, so that it won't have to bind them.



Oh, that sounds quite handy. I wasn't aware of it. I'll add it soonish. 
Thanks for the hint!



Regards,

Olivier


--
Regards,
Christian Ruppert



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Olivier Houchard
On Fri, Mar 27, 2020 at 04:32:21PM +0100, Christian Ruppert wrote:
> On 2020-03-27 16:27, Olivier Houchard wrote:
> > On Fri, Mar 27, 2020 at 04:21:20PM +0100, Christian Ruppert wrote:
> >> During the reload I just found something in the daemon log:
> >> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
> >> Starting proxy someotherlistener: cannot bind socket [0.0.0.0:18540]
> >> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
> >> Starting proxy someotherlistener: cannot bind socket [:::18540]
> >> 
> >> So during the reload, this happened and seems to have caused any 
> >> further
> >> issues/trouble.
> >> 
> > 
> > That would make sense. Does that mean you have old processes hanging
> > around ? Do you use seemless reload ? If so, it shouldn't attempt to
> > bind the socket, but get them from the old process.
> 
> I remember that it was necessary to have a systemd wrapper around, as it 
> caused trouble otherwise, due to PID being changed etc.
> Not sure if that wrapper is still in use. In this case it's systemd 
> though.
> [Unit]
> Description=HAProxy Load Balancer
> After=network.target
> 
> [Service]
> Environment="CONFIG=/etc/haproxy/haproxy.cfg" "PIDFILE=/run/haproxy.pid"
> ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q
> ExecStart=/usr/sbin/haproxy -Ws -f $CONFIG -p $PIDFILE
> ExecReload=/usr/sbin/haproxy -f $CONFIG -c -q
> ExecReload=/bin/kill -USR2 $MAINPID
> KillMode=mixed
> Restart=always
> SuccessExitStatus=143
> TimeoutStopSec=30
> Type=notify

[...]

> We've added the TimeoutStopSec=30 for some reason (I'd have to ask my 
> college, something took longer or something like that, since we have 
> quite a lot of frontends/listener/backend)
> Only the two processes I mentioned before are / were running. Seems like 
> the fallback didn't work properly?
> 

The wrapper is no longer needed, it has been superceeded by the
master-worker (which you seem to be using, given you're using -Ws).
It is possible the old process refuse to die, and you end up hitting the
timeout and it gets killed eventually, but it's too late.
Do you have a expose-fd listeners on the unix stats socket ? Using it
will allow the new process to connect to the old process' stats socket,
and get all the listening sockets, so that it won't have to bind them.

Regards,

Olivier



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert

On 2020-03-27 16:27, Olivier Houchard wrote:

On Fri, Mar 27, 2020 at 04:21:20PM +0100, Christian Ruppert wrote:

During the reload I just found something in the daemon log:
Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
Starting proxy someotherlistener: cannot bind socket [0.0.0.0:18540]
Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) :
Starting proxy someotherlistener: cannot bind socket [:::18540]

So during the reload, this happened and seems to have caused any 
further

issues/trouble.



That would make sense. Does that mean you have old processes hanging
around ? Do you use seemless reload ? If so, it shouldn't attempt to
bind the socket, but get them from the old process.


I remember that it was necessary to have a systemd wrapper around, as it 
caused trouble otherwise, due to PID being changed etc.
Not sure if that wrapper is still in use. In this case it's systemd 
though.

[Unit]
Description=HAProxy Load Balancer
After=network.target

[Service]
Environment="CONFIG=/etc/haproxy/haproxy.cfg" "PIDFILE=/run/haproxy.pid"
ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q
ExecStart=/usr/sbin/haproxy -Ws -f $CONFIG -p $PIDFILE
ExecReload=/usr/sbin/haproxy -f $CONFIG -c -q
ExecReload=/bin/kill -USR2 $MAINPID
KillMode=mixed
Restart=always
SuccessExitStatus=143
TimeoutStopSec=30
Type=notify

# The following lines leverage SystemD's sandboxing options to provide
# defense in depth protection at the expense of restricting some 
flexibility

# in your setup (e.g. placement of your configuration files) or possibly
# reduced performance. See systemd.service(5) and systemd.exec(5) for 
further

# information.

# NoNewPrivileges=true
# ProtectHome=true
# If you want to use 'ProtectSystem=strict' you should whitelist the 
PIDFILE,

# any state files and any other files written using 'ReadWritePaths' or
# 'RuntimeDirectory'.
# ProtectSystem=true
# ProtectKernelTunables=true
# ProtectKernelModules=true
# ProtectControlGroups=true
# If your SystemD version supports them, you can add: @reboot, @swap, 
@sync

# SystemCallFilter=~@cpu-emulation @keyring @module @obsolete @raw-io

[Install]
WantedBy=multi-user.target


We've added the TimeoutStopSec=30 for some reason (I'd have to ask my 
college, something took longer or something like that, since we have 
quite a lot of frontends/listener/backend)
Only the two processes I mentioned before are / were running. Seems like 
the fallback didn't work properly?




Regards,

Olivier


--
Regards,
Christian Ruppert



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Olivier Houchard
On Fri, Mar 27, 2020 at 04:21:20PM +0100, Christian Ruppert wrote:
> During the reload I just found something in the daemon log:
> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) : 
> Starting proxy someotherlistener: cannot bind socket [0.0.0.0:18540]
> Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) : 
> Starting proxy someotherlistener: cannot bind socket [:::18540]
> 
> So during the reload, this happened and seems to have caused any further 
> issues/trouble.
> 

That would make sense. Does that mean you have old processes hanging
around ? Do you use seemless reload ? If so, it shouldn't attempt to
bind the socket, but get them from the old process.

Regards,

Olivier



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert

During the reload I just found something in the daemon log:
Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) : 
Starting proxy someotherlistener: cannot bind socket [0.0.0.0:18540]
Mar 27 13:37:54 somelb haproxy[20799]: [ALERT] 086/133748 (20799) : 
Starting proxy someotherlistener: cannot bind socket [:::18540]


So during the reload, this happened and seems to have caused any further 
issues/trouble.


On 2020-03-27 15:10, Christian Ruppert wrote:

So now I looked for more of those "SC"'s in the log, from our
monitoring and it appeared first around 13:38:01.
Around 13:37:54 a reload was issued by puppet or rundeck.
So right now, it seems that something happened during the reload which
affected UNIX sockets.

On 2020-03-27 15:00, Christian Ruppert wrote:

Hi Olivier,

On 2020-03-27 14:50, Olivier Houchard wrote:

Hi Christian,

On Fri, Mar 27, 2020 at 02:37:41PM +0100, Christian Ruppert wrote:

Hi list,

we have some weird issues now, the second time, that *some* SSL 
sockets

seem to be broken as well as stats sockets.
HTTP seems to work fine, still, SSL ones are broken however. It 
happened
at least on 2.1.3 and *perhaps* on 2.1.2 as well. We're not sure 
whether

the first time was on 2.1.2 or 2.1.3.
The one that failed today was updated yesterday, so HAProxy has an
uptime of about 24h.
We're using threads. default + HTTP is using 1 thread, 1 is 
dedicated
for a TCP listener/Layer-4, one is for RSA only and all the rest is 
for

ECC.

[...]
The problem ocurred arount 13:40 (CET, in case it matters at some 
point)


Any ideas so far?



So basically, it used to work, and suddenly you get errors on any TLS
connection ?


Yeah, right now it looks like that way.

If you still have the TCP stat socket working, can you show the 
output

of "show fd" ?


Oh, it's the http stats listener that's still working. Not sure
whether it accepts any commands to be honest.
pid = 21313 (process #1, nbproc = 1, nbthread = 8)
uptime = 0d 1h56m48s
system limits: memmax = unlimited; ulimit-n = 1574819
maxsock = 1574819; maxconn = 786432; maxpipes = 0
current conns = 6; current pipes = 0/0; conn rate = 43/sec; bit rate =
219.704 kbps
Running tasks: 1/1158; idle = 100 %



Thanks !

Olivier


--
Regards,
Christian Ruppert



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert
So now I looked for more of those "SC"'s in the log, from our monitoring 
and it appeared first around 13:38:01.

Around 13:37:54 a reload was issued by puppet or rundeck.
So right now, it seems that something happened during the reload which 
affected UNIX sockets.


On 2020-03-27 15:00, Christian Ruppert wrote:

Hi Olivier,

On 2020-03-27 14:50, Olivier Houchard wrote:

Hi Christian,

On Fri, Mar 27, 2020 at 02:37:41PM +0100, Christian Ruppert wrote:

Hi list,

we have some weird issues now, the second time, that *some* SSL 
sockets

seem to be broken as well as stats sockets.
HTTP seems to work fine, still, SSL ones are broken however. It 
happened
at least on 2.1.3 and *perhaps* on 2.1.2 as well. We're not sure 
whether

the first time was on 2.1.2 or 2.1.3.
The one that failed today was updated yesterday, so HAProxy has an
uptime of about 24h.
We're using threads. default + HTTP is using 1 thread, 1 is dedicated
for a TCP listener/Layer-4, one is for RSA only and all the rest is 
for

ECC.

[...]
The problem ocurred arount 13:40 (CET, in case it matters at some 
point)


Any ideas so far?



So basically, it used to work, and suddenly you get errors on any TLS
connection ?


Yeah, right now it looks like that way.


If you still have the TCP stat socket working, can you show the output
of "show fd" ?


Oh, it's the http stats listener that's still working. Not sure
whether it accepts any commands to be honest.
pid = 21313 (process #1, nbproc = 1, nbthread = 8)
uptime = 0d 1h56m48s
system limits: memmax = unlimited; ulimit-n = 1574819
maxsock = 1574819; maxconn = 786432; maxpipes = 0
current conns = 6; current pipes = 0/0; conn rate = 43/sec; bit rate =
219.704 kbps
Running tasks: 1/1158; idle = 100 %



Thanks !

Olivier


--
Regards,
Christian Ruppert



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert

Hi Olivier,

On 2020-03-27 14:50, Olivier Houchard wrote:

Hi Christian,

On Fri, Mar 27, 2020 at 02:37:41PM +0100, Christian Ruppert wrote:

Hi list,

we have some weird issues now, the second time, that *some* SSL 
sockets

seem to be broken as well as stats sockets.
HTTP seems to work fine, still, SSL ones are broken however. It 
happened
at least on 2.1.3 and *perhaps* on 2.1.2 as well. We're not sure 
whether

the first time was on 2.1.2 or 2.1.3.
The one that failed today was updated yesterday, so HAProxy has an
uptime of about 24h.
We're using threads. default + HTTP is using 1 thread, 1 is dedicated
for a TCP listener/Layer-4, one is for RSA only and all the rest is 
for

ECC.

[...]
The problem ocurred arount 13:40 (CET, in case it matters at some 
point)


Any ideas so far?



So basically, it used to work, and suddenly you get errors on any TLS
connection ?


Yeah, right now it looks like that way.


If you still have the TCP stat socket working, can you show the output
of "show fd" ?


Oh, it's the http stats listener that's still working. Not sure whether 
it accepts any commands to be honest.

pid = 21313 (process #1, nbproc = 1, nbthread = 8)
uptime = 0d 1h56m48s
system limits: memmax = unlimited; ulimit-n = 1574819
maxsock = 1574819; maxconn = 786432; maxpipes = 0
current conns = 6; current pipes = 0/0; conn rate = 43/sec; bit rate = 
219.704 kbps

Running tasks: 1/1158; idle = 100 %



Thanks !

Olivier


--
Regards,
Christian Ruppert



Re: Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Olivier Houchard
Hi Christian,

On Fri, Mar 27, 2020 at 02:37:41PM +0100, Christian Ruppert wrote:
> Hi list,
> 
> we have some weird issues now, the second time, that *some* SSL sockets 
> seem to be broken as well as stats sockets.
> HTTP seems to work fine, still, SSL ones are broken however. It happened 
> at least on 2.1.3 and *perhaps* on 2.1.2 as well. We're not sure whether 
> the first time was on 2.1.2 or 2.1.3.
> The one that failed today was updated yesterday, so HAProxy has an 
> uptime of about 24h.
> We're using threads. default + HTTP is using 1 thread, 1 is dedicated 
> for a TCP listener/Layer-4, one is for RSA only and all the rest is for 
> ECC.
[...]
> The problem ocurred arount 13:40 (CET, in case it matters at some point)
> 
> Any ideas so far?
> 

So basically, it used to work, and suddenly you get errors on any TLS
connection ?
If you still have the TCP stat socket working, can you show the output
of "show fd" ?

Thanks !

Olivier



Weird issues with UNIX-Sockets on 2.1.x

2020-03-27 Thread Christian Ruppert

Hi list,

we have some weird issues now, the second time, that *some* SSL sockets 
seem to be broken as well as stats sockets.
HTTP seems to work fine, still, SSL ones are broken however. It happened 
at least on 2.1.3 and *perhaps* on 2.1.2 as well. We're not sure whether 
the first time was on 2.1.2 or 2.1.3.
The one that failed today was updated yesterday, so HAProxy has an 
uptime of about 24h.
We're using threads. default + HTTP is using 1 thread, 1 is dedicated 
for a TCP listener/Layer-4, one is for RSA only and all the rest is for 
ECC.


A curl looks like:
# curl -kvs https://127.0.0.1:17920
* Rebuilt URL to: https://127.0.0.1:17920/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 17920 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: 
ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH

* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* Unknown SSL protocol error in connection to 127.0.0.1:17920
* Curl_http_done: called premature == 1
* stopped the pause stream!
* Closing connection 0

Mar 27 14:26:25 somelb haproxy[21313]: 127.0.0.1:6762 
[27/Mar/2020:14:26:25.972] layer4stuff layer4stuff/socket-17920-ecc 
5/-1/4 0 SC 2/1/0/0/3 0/0


The config looks like this:
listen layer4stuff
mode tcp

bind :17920 process 1/2
bind :::17920 process 1/2

log global
option tcplog

# some early L4-ACLs

use-server socket-17920-rsa if !REQ_TLS_HAS_ECC
server socket-17920-rsa unix@/var/run/haproxy-17920-rsa.sock 
send-proxy-v2


use-server socket-17920-ecc if REQ_TLS_HAS_ECC
server socket-17920-ecc unix@/var/run/haproxy-17920-ecc.sock 
send-proxy-v2


listen tls-splitter
mode tcp
log global
option tcplog

bind unix@/var/run/haproxy-17920-rsa.sock accept-proxy user 
haproxy group root mode 600 ssl crt 
/etc/haproxy/haproxy.d/frontend/17920-RSA/ no-tlsv10 no-tlsv11 ciphers 
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256 
alpn h2,http/1.1 process 1/3
bind unix@/var/run/haproxy-17920-ecc.sock accept-proxy user 
haproxy group root mode 600 ssl crt 
/etc/haproxy/haproxy.d/frontend/17920-ECC/ no-tlsv10 no-tlsv11 ciphers 
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256 
alpn h2,http/1.1 process 1/4-


use-server socket-17910-h2 if { ssl_fc_alpn h2 }
server socket-17910-h2 unix@/var/run/haproxy-17910-h2.sock 
send-proxy-v2

use-server socket-17910 if !{ ssl_fc_alpn h2 }
server socket-17910 unix@/var/run/haproxy-17910.sock 
send-proxy-v2



frontend somefrontend
bind :17910 process 1/1
bind :::17910 process 1/1
bind unix@/var/run/haproxy-17910-h2.sock id 317910 accept-proxy 
user haproxy group root mode 600 proto h2 process 1/1 # H/2 TLS uplink
bind unix@/var/run/haproxy-17910.sock id 217910 accept-proxy 
user haproxy group root mode 600 process 1/1 # TLS uplink

acl is_https so_id 217910
acl is_https so_id 317910

mode http

# Some more stuff here

We've separated this broken loadbalancer for now, so if you want me to 
do anything with the running process, let me know.
The above is basically the same for every other site. Our config is 
quite huge.
Since the stats socket is broken as well, it may be something related to 
the UNIX sockets.

The stats TCP socket is still working.

# echo help | socat - /var/run/haproxy.stat
2020/03/27 14:23:54 socat[21715] E connect(5, AF=1 
"/var/run/haproxy.stat", 23): Connection refused


# ps aux|grep haproxy
root 20799  0.0  0.0  49624  7608 ?Ss   Mar26   1:10 
/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid 
-sf 21313
haproxy  21313  5.4  0.7 887168 236788 ?   Sl   13:00   4:59 
/usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid 
-sf 15635


The problem ocurred arount 13:40 (CET, in case it matters at some point)

Any ideas so far?

I'm currently trying to find something useful in the log around that 
time, whether it was a reload or something.


Debian 9.12 currently
# haproxy -vv
HA-Proxy version 2.1.3 2020/02/12 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2021.
Known bugs: http://www.haproxy.org/bugs/bugs-2.1.3.html
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement 
-fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter 
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered