Hi Jeff,
On Sun, Aug 14, 2011 at 04:01:53PM -0400, Jeff Buchbinder wrote:
> I've been working on an "API" patch, where certain functionality is
> exposed over the stats HTTP service. The "fork" where I have been
> working on this is available here:
>
> https://github.com/jbuchbinder/haproxy
>
> The full patch (which I imported from my old working tree) is here, if
> you want to see just the changes:
>
> https://github.com/jbuchbinder/haproxy/commit/0f924468977fc71f2530837e3e44cf47fc00fd0f
>
> Documentation is available here:
>
> https://github.com/jbuchbinder/haproxy/blob/master/README.API
>
> It was recently suggested that I attempt to get this patch included
> upstream.
Well, you apparently did a nice amount of work. I'm not opposed to an API
(after all, we've developped one at Exceliance too), but we need to respect
a certain amount of basic component behaviour rules. An API will never do
more than what the component itself is able to do, it's just a convenient
(or sometimes at least a different) way of making it do something it is
able to do.
If you look at how other load balancers work (and most network equipments
too BTW), you generally have multiple interaction levels between the user
and the internal state :
- monitoring : the user wants to check the current state of the product ;
- stats : the user wants to check some stats that were aggregated over
a period, sometimes since last clear or last reboot. Those generally
are counters ;
- operational status : the user wants to temporarily change something
for the current running session, because this can help him make some
other operations more transparent or better resist an unexpected
condition (eg: imbalanced servers after a failure).
- configuration status : the user wants a change to be performed and
kept until a new configuration change undoes it.
For quite some time, monitoring and stats have been available under various
forms (http or unix socket). Recently, a few operational status changes were
brought first on the unix socket and later on the web page. Those are still
limited (enable/disable of a server, clear a table entry, change a server's
weight) and even more limited for the web access. Still that starts to fit
some usages.
All of these accesses are performed for the currently running session. This
means that if you restart the process, everything is lost, quite similarly
as what you get if you reboot your router while you temporarily killed a
BGP session or shut a port. And this is expected, you want those changes
to be temporary because they're made in the process of something else.
The configuration status has a very different usage pattern : the change
that is performed must absolutely meet two important requirements :
- the changes that are performed must be kept across a restart ;
- what is performed must have the same effect after the restart that
it had during the hot change.
The first one is crucial : if your process dies and is restarted by a
monitoring tool without the user knowing it, all changes are lost and
nobody knows. Also, the process would restart with an old invalid config
which does not match what was running before the restart, until someone
pushes the changes again (provided someone is able to determine the diff
between what's in the file at the moment of restart and what was running
before it). Worse, some orthogonal changes may be performed live and in
the config file, making the addition of both incompatible. For instance,
you would add a server on the live API and in parallel, someone would
add the cookie from the config as well as to all other servers. If after
a restart you re-apply the same changes, you'll get a wrong config with
the last added server which does not have any cookie.
=> This means that you need your file-based config to always be in
sync with the API changes. There is no reliable way of doing so
in any component if the changes are applied at two distinct
places at the same time !
The second point is important too : even if we assume that you find a way
to more-or-less ensure that your config file gets the equivalent changes
and is up to date, you must absolutely ensure that what is there will work
upon a restart and will exhibit the exact same behaviour.
There are a large number of issues that can arise from performing changes
in a different order than what is done at once upon start-up. Most people
who had to deal with Alteon LBs for instance, know that sometimes something
does not behave as expected after a change and a reboot fixes the issue
(eg: renumbering a large number of filters, or changing health checking
methods). And there's nothing really wrong with that, it's just that the
problem by itself is complex. On unix-like systems, many of us have already
been hit by an issue involving two services bound to the same port, one on
the real IP and the other one bound to 0.0.0.0. If you bind 0.0.0.0 first,
on most systems both may bind, while the reverse is not allowed. Back to
our API case, imagine you have two frontends declared this way :
frontend f1
bind 1.2.3.4:8888
frontend f2
bind 3.4.5.6:8080
You first change the IP of the second one to become 0.0.0.0:8080. It's fine,
the change is accepted. Now you change the second one's port to become 8080.
It's fine too, the address is more precise than 0.0.0.0, the change is
accepted. You perform the same change in parallel to your file-based config :
frontend f1
bind 1.2.3.4:8080
frontend f2
bind 0.0.0.0:8080
One saturday evening you get a power hicup and your server reboots. Haproxy
will complain that frontend f2 cannot bind : "Address already in use".
There is only one way to solve these classes of issues, by respecting those
two rules :
- the changes must be performed to one single place, which is the reference
(here the config file)
- the changes must then be applied using the normal process from this
reference
What this means is that anything related to changing more than an operational
status must be performed on the config file first, then propagated to the
running processes using the same method that is used upon start up (config
parsing and loading).
Right now haproxy is not able to reload a config once it's started. And since
we chroot it, it will not be able to access the FS afterwards. However we can
reload a new process with the new config (that's what most of us are currently
doing).
But I predict that we'll one day reach a point where the restarting process
involves calling the haproxy program to parse and compile the config file,
then feed it compiled to either the already running process or the new one
just being forked for that purpose.
That's why I'm very interested in Simon's work concerning the master-worker
model. After a first positive impression we found a number of structural
issues that make it very hard to implement quickly in haproxy as it is now,
but the time we spend trying to understand and solve those issues comforts
us in the idea that it clearly is the way to go for a more dynamic process
with hot config changes and the like.
With all that, I'd say that I'm very open to have an API to check/change
monitoring, stats and operational state, but I will no accept an API to
change the running config without passing via the config file which is
the only reference we can trust. I would possibly accept extending the
operational changes to a wider area than what seems reasonable, in order
to help getting out of silly situations, but I doubt that there would be
any real justification for doing such specific things via an API (eg:
changing a running server's IP address, or killing a frontend to release
the port without stopping the rest).
Last, there are a number of remaining issues that are not overcome by the
current patch, making it risky for you to use it in production :
- a heavily loaded proxy may lack memory (I'm used to see that quite
often during benchmarks), and a number of malloc() or strdup() are
not checked. This means that calling the API for a change under
harsh conditions might cause a process crash. It's not always easy
to perform atomic changes because you may need to undo many things
when the failure happens late in the change (especially requests that
involve changing something which exists). I noticed a significant
number of strdup() that were not needed and that could simply be
removed though instead of fixing them.
- resources are allocated upon startup. If you want to add a server,
you need to reserve a file descriptor for its checks. Not doing so
will mean that by adding many servers, you'll sensibly reduce the
number of available FDs, causing connection resets to some clients
when the limit is reached. Similarly, adding listening sockets to
frontends will consume some permanent FDs that deduce from the total
amount of available ones. If you have a 1024 fd limit by default and
add a binding to a 1024 port range, you'll eat all possible sockets
and you won't be able to reconnect to disable your change ; Adding
frontends, loggers and servers requires adding FDs so those extra FDs
reserved for that purpose should be configurable and the limit
enforced on your API (eg: "too many servers added").
- since you make use of some malloc(), you need to be extremely careful
about the risk of memory leak, especially in error paths. An API usually
gets pushed to automated components, and you wouldn't like your LBs to
eat all their memory in 2 days and need regular reboots. That's why
malloc() is never used outside the pools. The pools are less convenient
to use but much more robust, and specifically the care they require from
the developer ensures that those issues can almost never happen.
Another important point is that an API must be discussed with all its
adopters. At exceliance, we discussed ours with interested customers to
take their ideas into account. It's not because they were customers but
because they were future adopters. It's very possible that the way you
designed it perfectly fits your purpose but will be unusable to many other
people for a variety of reasons. Designing a usable an evolutive API may
take months of discussions but it's probably worth it.
Best regards,
Willy