Hi Willy,

> > > A few more things on the core dumps :
> > >  - they are ignored if you have a chroot statement in the global section
> > >  - you need not to use "user/uid/group/gid" otherwise the system also
> > >    disables core dumps
> > 
> > I'm using chroot and user/group in my config, so I'm not able to share core 
> > dumps.
>
> Well, if at least you can attach gdb to a process hoping to see it stop and 
> emit
> "bt full" to see the whole backtrace, it will help a lot.

I will try to get a backtrace but this can take a while. I'm running with 7 
processes which respawn every few minutes. Crashing workers only happen every 
few hours at random moments. So I need some luck and timing here...

> > > There are very few abort() calls in the code :
> > >  - some in the thread debugging code to detect recursive locks ;
> > >  - one in the cache applet which triggers on an impossible case very
> > >    likely resulting from cache corruption (hence a bug)
> > >  - a few inside the Lua library
> > >  - a few in the HPACK decompressor, detecting a few possible bugs there
> > >
> > > Except for Lua, all of them were added during 1.8, so depending on what 
> > > the
> > > configuration uses, there are very few possible candidates.
> > 
> > I added my configuration in this mail. Hopefully this will narrow down the
> > possible candidates.
>
> Well, at least you don't use threads nor lua nor caching nor HTTP/2 so
> it cannot come from any of those we have identified. It could still come
> from openssl however.

There are some bugfixes marked as medium in the haproxy 1.8 repository related 
to SSL. Would it be possible that they are related to the crashes I'm seeing?
 
> > I did some more research to the memory warnings we encounter every few days.
> > It seems like the haproxy processes use a lot of memory. Would haproxy with
> > nbthreads share this memory?
>
> It depends. In fact, the memory will indeed be shared between threads
> started together, but if this memory is consumed at load time and never
> modified, it's also shared between the processes already.
>
> > I'm using systemd to reload haproxy for new SSL certificates every few 
> > minutes.
>
> OK. I'm seeing that you load certs from a directory in your config. Do you
> have a high number of certs ? I'm asking because we've already seen some
> configs eating multiple gigs of RAM with the certs because there were a lot.

Yes I have around 40k SSL certificates in this directory and is growing over 
time.

> In your case they're loaded twice (one for the IPv4 bind line, one for the
> IPv6). William planned to work on a way to merge all identical certs and have
> a single instance of them when loaded multiple times, which should already
> reduce the amount of memory consumed by this.

I can bind ipv4 and ipv6 in the same line with:
bind ipv4@:443,ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534

This would also solve the "double load" issue right?

> > frontend fe_http
> >     bind ipv4@:80 backlog 65534
> >     bind ipv6@:80 backlog 65534
> >     bind ipv4@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> >     bind ipv6@:443 ssl crt /etc/haproxy/ssl/invalid.pem crt 
> >/etc/haproxy/ssl/ crt /etc/haproxy/customer-ssl/ strict-sni backlog 65534
> >     bind-process 1-7
> >     tcp-request inspect-delay 5s
> >     tcp-request content accept if { req_ssl_hello_type 1 }
>
> This one is particularly strange. I suspect it's a leftover from an old
> configuration dating from the days where haproxy didn't support SSL,
> because it's looking for SSL messages inside the HTTP traffic, which
> will never be present. You can safely remove tose two lines.

We use this to guard against some attacks we have seen in the past. Setting up 
connections without ssl handshake to use all available connections. I will 
remove them if you are sure this no longer works.

> >     option forwardfor
> >     acl secure dst_port 443
> >     acl is_acme_request path_beg /.well-known/acme-challenge/
> >     reqadd X-Forwarded-Proto:\ https if secure
> >     default_backend be_reservedpage
> >     use_backend be_acme if is_acme_request
> >     use_backend 
> >%[req.fhdr(host),lower,map_dom(/etc/haproxy/domain2backend.map)]
> >     compression algo gzip
> >     maxconn 32000
>
> In my opinion its not a good idea to let a single frontend steal all the
> process's connections, it will prevent you from connecting to the stats
> page when a problem happens. You should have a slightly larger global
> maxconn setting to avoid this.

Yes you are right, I will fix this in my configuration.

> > backend be_acme
> >     bind-process 1
> >     option httpchk HEAD /ping.php HTTP/1.1\r\nHost:\ **removed hostname**
> >     option http-server-close
> >     option http-pretend-keepalive
>
> Same comment here as above regarding close vs keep-alive.

Yes I will take a look at this as well.

> Aside this I really see nothing suspicious in your configuration that could
> justify a problem. Let's hope you can at least either catch a core or attach
> a gdb to one of these processes.

I will let you know as soon as I'm able to get a backtrace. In the meanwhile I 
will improve and test my new configuration changes.

Thanks,
Frank

Reply via email to