Hi William,

On Mon, Oct 21, 2019 at 11:14:20AM +0200, William Dauchy wrote:
> On a production environment with a given maxconn, the fd limit can be
> successfully set at a given time. While raising maxconn, a new max fd
> limit is calculated. If the setrlimit fails (e.g if sysctl fs.nr_open
> is lower), we silently revert to the max obtained through getrlimit,
> which can be way lower than the previous one set, with a working maxconn;
> we only deliver a warning at startup.
> In our production environment we rollout new configuration through
> progressive reload. If the reload/restart fails (i.e if the health check
> fails), the rollout is stopped. Before this patch, the new maxconn is
> applied silently whether setrlimit succeed or not; i.e you might easily
> face a case where fd limit is reverted to rlim_max making your
> production environment unstable very quickly.

Yes, I'm well aware of this impact as well :-/

> That's why we propose to explicitly fail when fd limit cannot be raised
> to avoid bad surprises when a new maxconn is set.
> We extended the failing case to all setrlimit. This seem to be more in
> correlation with other config errors where haproxy immediately fails.

To be honest, I'm quite embarrassed by such a change. I think the main
reason for the initial warning instead of error was that most people
using haproxy in unprivileged environments never knew what to use as
a ulimit-n or maxconn value, they only needed something that worked
fine for small traffic. That's also how developers use it by the way,
each time I start a config with explicit settings, I receive this warning
and it allows me to test it anyway. My concern is that this will cause a
large number of deployed system to simply stop working.

On the other hand, we do have automatic settings now (at least for the
FD) so it is easier for such users to simply drop the maxconn line and
be fine with it.

Probably that a good approach would be to have a new directive (or set
of directives) such as "strict-limits" which would indicate if we need
to warn or fail. We could then decide to let it warn by default and
change it to strict mode by default later. Leaving at least one round
of LTS with a warning indicating that this needs to be fixed or it will
fail in the next version will be OK. This basically means keeping the
default to relaxed until 2.2 and make it strict by default in 2.3.

I suspect that we should set the memory behavior separately, but at
the same time I don't recall having ever seen this one. Thus maybe we
could already make it strict by default.

Anyway I think you get the general point, warn users first that it will
break next time, then break by default. It will leave them with enough
time to fix their config or report their constraints.

Just let me know if that's a good enough plan for you.


Reply via email to