Hi Jeff,

On Sun, Aug 14, 2011 at 04:01:53PM -0400, Jeff Buchbinder wrote:
> I've been working on an "API" patch, where certain functionality is
> exposed over the stats HTTP service. The "fork" where I have been
> working on this is available here:
> 
> https://github.com/jbuchbinder/haproxy
> 
> The full patch (which I imported from my old working tree) is here, if
> you want to see just the changes:
> 
> https://github.com/jbuchbinder/haproxy/commit/0f924468977fc71f2530837e3e44cf47fc00fd0f
> 
> Documentation is available here:
> 
> https://github.com/jbuchbinder/haproxy/blob/master/README.API
> 
> It was recently suggested that I attempt to get this patch included
> upstream.

Well, you apparently did a nice amount of work. I'm not opposed to an API
(after all, we've developped one at Exceliance too), but we need to respect
a certain amount of basic component behaviour rules. An API will never do
more than what the component itself is able to do, it's just a convenient
(or sometimes at least a different) way of making it do something it is
able to do.

If you look at how other load balancers work (and most network equipments
too BTW), you generally have multiple interaction levels between the user
and the internal state :

  - monitoring : the user wants to check the current state of the product ;
  - stats : the user wants to check some stats that were aggregated over
    a period, sometimes since last clear or last reboot. Those generally
    are counters ;
  - operational status : the user wants to temporarily change something
    for the current running session, because this can help him make some
    other operations more transparent or better resist an unexpected
    condition (eg: imbalanced servers after a failure).
  - configuration status : the user wants a change to be performed and
    kept until a new configuration change undoes it.

For quite some time, monitoring and stats have been available under various
forms (http or unix socket). Recently, a few operational status changes were
brought first on the unix socket and later on the web page. Those are still
limited (enable/disable of a server, clear a table entry, change a server's
weight) and even more limited for the web access. Still that starts to fit
some usages.

All of these accesses are performed for the currently running session. This
means that if you restart the process, everything is lost, quite similarly
as what you get if you reboot your router while you temporarily killed a
BGP session or shut a port. And this is expected, you want those changes
to be temporary because they're made in the process of something else.

The configuration status has a very different usage pattern : the change
that is performed must absolutely meet two important requirements :
  - the changes that are performed must be kept across a restart ;
  - what is performed must have the same effect after the restart that
    it had during the hot change.

The first one is crucial : if your process dies and is restarted by a
monitoring tool without the user knowing it, all changes are lost and
nobody knows. Also, the process would restart with an old invalid config
which does not match what was running before the restart, until someone
pushes the changes again (provided someone is able to determine the diff
between what's in the file at the moment of restart and what was running
before it). Worse, some orthogonal changes may be performed live and in
the config file, making the addition of both incompatible. For instance,
you would add a server on the live API and in parallel, someone would
add the cookie from the config as well as to all other servers. If after
a restart you re-apply the same changes, you'll get a wrong config with
the last added server which does not have any cookie.

  => This means that you need your file-based config to always be in
     sync with the API changes. There is no reliable way of doing so
     in any component if the changes are applied at two distinct
     places at the same time !

The second point is important too : even if we assume that you find a way
to more-or-less ensure that your config file gets the equivalent changes
and is up to date, you must absolutely ensure that what is there will work
upon a restart and will exhibit the exact same behaviour.

There are a large number of issues that can arise from performing changes
in a different order than what is done at once upon start-up. Most people
who had to deal with Alteon LBs for instance, know that sometimes something
does not behave as expected after a change and a reboot fixes the issue
(eg: renumbering a large number of filters, or changing health checking
methods). And there's nothing really wrong with that, it's just that the
problem by itself is complex. On unix-like systems, many of us have already
been hit by an issue involving two services bound to the same port, one on
the real IP and the other one bound to 0.0.0.0. If you bind 0.0.0.0 first,
on most systems both may bind, while the reverse is not allowed. Back to
our API case, imagine you have two frontends declared this way :

   frontend f1
        bind 1.2.3.4:8888

   frontend f2
        bind 3.4.5.6:8080

You first change the IP of the second one to become 0.0.0.0:8080. It's fine,
the change is accepted. Now you change the second one's port to become 8080.
It's fine too, the address is more precise than 0.0.0.0, the change is
accepted. You perform the same change in parallel to your file-based config :

   frontend f1
        bind 1.2.3.4:8080

   frontend f2
        bind 0.0.0.0:8080

One saturday evening you get a power hicup and your server reboots. Haproxy
will complain that frontend f2 cannot bind : "Address already in use".

There is only one way to solve these classes of issues, by respecting those
two rules :
  - the changes must be performed to one single place, which is the reference
    (here the config file)
  - the changes must then be applied using the normal process from this
    reference

What this means is that anything related to changing more than an operational
status must be performed on the config file first, then propagated to the
running processes using the same method that is used upon start up (config
parsing and loading).

Right now haproxy is not able to reload a config once it's started. And since
we chroot it, it will not be able to access the FS afterwards. However we can
reload a new process with the new config (that's what most of us are currently
doing).  

But I predict that we'll one day reach a point where the restarting process
involves calling the haproxy program to parse and compile the config file,
then feed it compiled to either the already running process or the new one
just being forked for that purpose.

That's why I'm very interested in Simon's work concerning the master-worker
model. After a first positive impression we found a number of structural
issues that make it very hard to implement quickly in haproxy as it is now,
but the time we spend trying to understand and solve those issues comforts
us in the idea that it clearly is the way to go for a more dynamic process
with hot config changes and the like.

With all that, I'd say that I'm very open to have an API to check/change
monitoring, stats and operational state, but I will no accept an API to
change the running config without passing via the config file which is
the only reference we can trust. I would possibly accept extending the
operational changes to a wider area than what seems reasonable, in order
to help getting out of silly situations, but I doubt that there would be
any real justification for doing such specific things via an API (eg: 
changing a running server's IP address, or killing a frontend to release
the port without stopping the rest).

Last, there are a number of remaining issues that are not overcome by the
current patch, making it risky for you to use it in production :
  - a heavily loaded proxy may lack memory (I'm used to see that quite
    often during benchmarks), and a number of malloc() or strdup() are
    not checked. This means that calling the API for a change under
    harsh conditions might cause a process crash. It's not always easy
    to perform atomic changes because you may need to undo many things
    when the failure happens late in the change (especially requests that
    involve changing something which exists). I noticed a significant
    number of strdup() that were not needed and that could simply be
    removed though instead of fixing them.

  - resources are allocated upon startup. If you want to add a server,
    you need to reserve a file descriptor for its checks. Not doing so
    will mean that by adding many servers, you'll sensibly reduce the
    number of available FDs, causing connection resets to some clients
    when the limit is reached. Similarly, adding listening sockets to
    frontends will consume some permanent FDs that deduce from the total
    amount of available ones. If you have a 1024 fd limit by default and
    add a binding to a 1024 port range, you'll eat all possible sockets
    and you won't be able to reconnect to disable your change ; Adding
    frontends, loggers and servers requires adding FDs so those extra FDs
    reserved for that purpose should be configurable and the limit
    enforced on your API (eg: "too many servers added").

  - since you make use of some malloc(), you need to be extremely careful
    about the risk of memory leak, especially in error paths. An API usually
    gets pushed to automated components, and you wouldn't like your LBs to
    eat all their memory in 2 days and need regular reboots. That's why
    malloc() is never used outside the pools. The pools are less convenient
    to use but much more robust, and specifically the care they require from
    the developer ensures that those issues can almost never happen.

Another important point is that an API must be discussed with all its
adopters. At exceliance, we discussed ours with interested customers to
take their ideas into account. It's not because they were customers but
because they were future adopters. It's very possible that the way you
designed it perfectly fits your purpose but will be unusable to many other
people for a variety of reasons. Designing a usable an evolutive API may
take months of discussions but it's probably worth it.

Best regards,
Willy


Reply via email to