On Fri, Nov 13, 2020 at 01:30:21PM +0100, Julien Pivotto wrote:
> On 13 Nov 00:12, Jonathan Matthews wrote:
> > I think retry on 4XX, without modifying the request, is a terrible idea.
> > It's pretty much the opposite of what the HTTP spec says, and isn't
> > something haproxy should learn how to do :-)

>From an HTTP perspective I 100% agree with you Jonathan. I resisted a
long time against the horrible principle of L7 retries due to this. The
problem is that nowadays web applications have become the ugliest thing
that the IT industry has ever produced. Many of them do not respect the
slightest standard, deployment processes do not exist anymore, users
expect docker kill/docker run as the normal way to upgrade, and usually
the application will bind long before even being ready. The worst here
is that often owners of such applications will blame the last added
component for any breakage (usually the LB is the last added one because
in addition to being ugly, they're terribly slow and cannot stand a load
beyond unit test, thus they need to be highly scalable from day one).

And those working in hosting environments have to deal with this and have
the choice of telling their customers "go away and find someone else to
host your thing that makes me throw up" or "OK let's see how I can work
around all your breakage". Given that many already use haproxy as a
Swiss Army Knife to work around application breakage it's natural that
such requests arrive here, and that's also how we ended up with so many
options :-/

If we didn't have support for L7 retries on any 4xx I would indeed have
rejected this demand, but here it's only a matter of adding support for
two extra status codes. I don't feel easy with these enabled in the "all"
set, but not doing it would be wrong. I think that instead we should emit
a warning when "all" is used saying something along "blindly retrying on
any event is always wrong and I suspect you did it by laziness, please use
option I-really-want-to-do-something-stupid" to shut this warning. At the
very least the doc should not remind the list for "all" and just say "all
retryable events, this will always lead to problems, never use this in
production".

The real issue I'm having with supporting dangerous setups is that most
new users who discover a component like haproxy early in the architecture
steps proceed the same way: copy-paste some examples, tweak them, spot a
few corner cases, adjust some in the application, look for support of
specific tricks in the doc to fill the remaining gaps, and in the end the
application can rely on some non-standard features or dangerous hacks
without the developers being aware of this. And when they start to ask for
help, it's too late, the application is already deployed standing on a
broken leg. This is why for me it's important to warn against misuse in
the doc itself. Those relying on hacks must be aware of this and of the
consequences. Regarding this point, I think the doc is pretty clear
already, so one can think "after all, why not".

> > I know it already knows how to do it on 404 (& 408) which I can see a
> > /slight/ rationale for, in a bulk-file-hosting,
> > round-robin-until-a-server-has-a-file situation. That's still, IMHO, the
> > wrong place for this to be implemented - it should be in-app, not in-proxy.
> > I genuinely don't think we should expand the set of 4XX responses that can
> > be automatically retried!
> > 
> > J
> 
> It is not the first time I hit arbitrary code selection limitation by
> HAProxy. I think that it should be up to the user/admin to decide what
> they can do, and that HAProxy should empower them to do so.

I think we should give the users the ability to fix bad situations they've
fallen into, but also make sure they don't accidentely shoot themselves
in the foot. If the doc said on a few options "if you use this, you're on
your own, do not even hope to get support from anyone", this can be a good
incentive for them to plan a better long-term solution when time permits.

> I am not willing to expose all the details why I exactly need this.
> 
> I am not up for a long debate about the patch however. It is open source
> software, I will carry the patch in my setup as long as I need it. If
> the broader community does not want the patch, so be it.

According to the points above I'm fine with getting the patch merged even
if I agree with Jonathan that 401/403 should really not be retried under
normal circumstances. But I can imagine some broken situations where it
may temporarily help like if the HTTP port is bound before the connection
to the auth server is established for example. However, Julien, as mentioned
by Christopher, please adjust the commit message so that we can merge it.

Thanks!
Willy

Reply via email to