Hi James,

On Thu, May 08, 2014 at 08:58:59PM +0100, James Hogarth wrote:
> On 2 May 2014 20:10, Willy Tarreau <w...@1wt.eu> wrote:
> 
> > You're welcome. I really want to release 1.5-final ASAP, but at least
> > with everything in place so that we can safely fix the minor remaining
> > annoyances. So if we identify quickly that things are still done wrong
> > and need to be addressed before the release (eg: because we'll be force
> > to change the way some config settings are used), better do it ASAP.
> > Otherwise if we're sure that a given config behaviour will not change,
> > such fixes can happen in -stable because they won't affect users which
> > do not rely on them.
> >
> >
> Alright in light of the above here's a RFC patch that's a little WIP still
> ... we've yet to write the documentation on the shm_balance mode but we are
> running this in a production environment.

I must confess I don't really understand well what behaviour this
shm_balance mode is supposed to provide. I'm seeing that the actconn
variable seems to be shared between all processes and is incremented
and decremented without any form of locking, so I'm a bit scared when
you say that it's running in production !

More comments below.

> Our environment is dev21 at present but I just rebased it to the tarball
> snapshot of last night...
> 
> It compiles against that but please not I've not yet tested it against that!
> 
> To give you an idea on how to use it here's a sanitised snippet of config:
> 
> global
>   nbproc 4
>   daemon
>   maxconn 4000
>   stats timeout 1d
>   log 127.0.0.1 local2
>   pidfile /var/run/haproxy.pid
>   stats socket /var/run/haproxy.1.sock level admin
>   stats socket /var/run/haproxy.2.sock level admin
>   stats socket /var/run/haproxy.3.sock level admin
>   stats socket /var/run/haproxy.4.sock level admin
>   stats bind-process all
>   shm-balance my_shm_balancer
> listen web-stats-1
>   bind 0.0.0.0:81
>   bind-process 1
>   mode http
>   log global
>   maxconn 10
>   clitimeout 10s
>   srvtimeout 10s
>   contimeout 10s
>   timeout queue 10s
>   stats enable
>   stats refresh 30s
>   stats show-node
>   stats show-legends
>   stats auth admin:password
>   stats uri /haproxy?stats
>  listen web-stats-2
>   bind 0.0.0.0:82
>   bind-process 2
>   mode http
>   log global
>   maxconn 10
>   clitimeout 10s
>   srvtimeout 10s
>   contimeout 10s
>   timeout queue 10s
>   stats enable
>   stats refresh 30s
>   stats show-node
>   stats show-legends
>   stats auth admin:password
>   stats uri /haproxy?stats
> listen web-stats-3
>   bind 0.0.0.0:83
>   bind-process 3
>   mode http
>   log global
>   maxconn 10
>   clitimeout 10s
>   srvtimeout 10s
>   contimeout 10s
>   timeout queue 10s
>   stats enable
>   stats refresh 30s
>   stats show-node
>   stats show-legends
>   stats auth admin:password
>   stats uri /haproxy?stats
> listen web-stats-4
>   bind 0.0.0.0:84
>   bind-process 4
>   mode http
>   log global
>   maxconn 10
>   clitimeout 10s
>   srvtimeout 10s
>   contimeout 10s
>   timeout queue 10s
>   stats enable
>   stats refresh 30s
>   stats show-node
>   stats show-legends
>   stats auth admin:password
>   stats uri /haproxy?stats
> listen frontendname
>   bind 0.0.0.0:52000
>   server server 10.0.0.1:27000 id 1 check port 9501
>   option httpchk GET /status HTTP/1.0
>   mode tcp
> 
> The haproxy-shm-client can be used to query the shm to see how things are
> loaded and weight/disable/enable threads from processing queries.

But why not use the stats socket instead of using a second access path to
check the status ?

> Now why did we do this?
> 
> When we were testing multiple processes one thing we noted was that the
> most likely process to accept() was actually a bit unintuitive. Rather than
> being busy causing a 'natural' load balancing behaviour it worked out
> against this.

Yes, that's the reason why the tune.maxaccept is divided by the number of
active processes. With recent kernels (3.9+), the system will automatically
round-robin between multiple socket queues bound to the same ip:port.
However this requires multiple sockets. With the latest changes allowing
the bind-process to go down to the listener (at last!), I realized that in
addition to allowing it for the stats socket (primary goal), it provides an
easy way to benefit from this kernel's round robin without having to create
a bind/unbind sequence as I was planning it.

> If a thread was currently on the CPU it was reasonably likely that it would
> be the first to grab the connection due to the need for ones 'idle' to
> context switch onto the CPU. As a result it was primarily only one or two
> haproxy processes actually picking up the connections and it made for very
> asymmetrical balancing across processes.

You see this pattern even more often when running local benchmarks. Just
run two processes on a dual-core, dual-thread system, then have the load
generator on the same system, and you'll see that the load generator disturbs
one of the process more than the other one.

> The algorithm looks to see if it is in the least busy 'half bucket' and if
> so will accept the connection... otherwise it will ignore it.

But then it will make things worse, because that means that your process
that was woken up by poll() to accept a connection and which doesn't want
it will suddenly enter a busy loop until another process accepts this
connection. I find this counter-productive. The principle with the current
maxaccept is that we reduce the number of connections a process is allowed
to accept at once, so that when it goes back to process them, by the time
it initializes everything in them, the other processes get a chance to grab
other connections. And that way we don't enter a busy loop doing nothing.

> There is
> 'deadpeer' tracking included so if a process stops responding for some
> reason the others should notice and not include it in the counts. In our
> internal testing this has presented much more balanced behaviour over the
> haproxy instances.
> 
> Attached to this are three files... two to apply to dev21 (like we are
> currently running) and one that is the rebase I just carried out.
> 
> I'll be writing up proper documentation over the next few days and if you
> have any queries feel free to drop me a line.
> 
> As a side note this code does use __builtin_popcount() which is a gcc-ism
> but given the makefile generally refers to gcc it seemed safe to do. If you
> have a recent CPU with the popcnt instruction (check your CPU flags) you
> can compile this with -mpopcnt for a small gain in latency during the
> nbproc >1 with shm enabled codepath.

In fact haproxy only supports gcc, but it supports old versions. The code is
known for still building with 2.95. I don't remember if __builtin_popcount()
was present back then.

However, to be honnest, I find all this immensely complicated and fragile
compared to just relying on the system to perform the proper load balancing
at the socket level. With the latest git code and the per-listener process
binding, your config above can go down to something as simple as :

 global
   nbproc 4
   daemon
   maxconn 4000
   stats timeout 1d
   log 127.0.0.1 local2
   pidfile /var/run/haproxy.pid
   stats socket /var/run/haproxy.1.sock process 1 level admin
   stats socket /var/run/haproxy.2.sock process 2 level admin
   stats socket /var/run/haproxy.3.sock process 3 level admin
   stats socket /var/run/haproxy.4.sock process 4 level admin
   shm-balance my_shm_balancer

 listen web-stats
   bind 0.0.0.0:81 process 1
   bind 0.0.0.0:82 process 2
   bind 0.0.0.0:83 process 3
   bind 0.0.0.0:84 process 4
   mode http
   log global
   maxconn 10
   clitimeout 10s
   srvtimeout 10s
   contimeout 10s
   timeout queue 10s
   stats enable
   stats refresh 30s
   stats show-node
   stats show-legends
   stats auth admin:password
   stats uri /haproxy?stats

 listen frontendname
   bind 0.0.0.0:52000
   server server 10.0.0.1:27000 id 1 check port 9501
   option httpchk GET /status HTTP/1.0
   mode tcp

And if you want "listen frontendname" to benefit from the kernel's
round robin, simply do this instead :

 listen frontendname
   bind 0.0.0.0:52000 process 1
   bind 0.0.0.0:52000 process 2
   bind 0.0.0.0:52000 process 3
   bind 0.0.0.0:52000 process 4
   server server 10.0.0.1:27000 id 1 check port 9501
   option httpchk GET /status HTTP/1.0
   mode tcp

I know that some people will not run on an OS which offers that feature,
but if they absolutely require it, I'd rather see them upgrade their kernel
than having to change the way haproxy accepts connections for everyone :-/

Note that I'm not dismissing your work, at least it works for you and when
you did it, the per-socket binding was still broken. The first attempt at
getting it right dates Jan 2013, and it's only after 4 failed attempts
since the beginning of this month that I got it to work, so for sure I could
not have asked you to wait indefinitely ! But if the goal is "only" to
improve inter-process load balancing, then I think it's overkill compared
to what we have now.

In this thread a few mails ago, you were speaking about improving SSL load
balancing, is it about this patch or do you have other patches that you'd
like to submit ?

Best regards,
Willy


Reply via email to