Re: Retry for idempotent backends
On Tue, Aug 28, 2018 at 11:11:52AM +, Edward Hibbert wrote: > This is partly a FAQ, but hopefully it goes beyond that...into a more subtle > FAQ, perhaps. > > I understand that in general it is dangerous for haproxy to retry a request > once it has been sent to a server, for example after a timeout or error, > because the server may not be idempotent. I get this objection. > > However there are some backends that are designed to be idempotent, or where > the consequences of retrying an operation are less bad than the consequences > of clients timing out. I have servers supporting an AJAX API over HTTP which > are designed in this way, partly because HTTP client retries can result in > duplicate POST requests anyway. > > It would be lovely if there was an option to retry in such situations, with > suitable caveats. We're already aware of this. It's not only a matter of option but a matter of important infrastructure changes. It's one of the long-term goals of the current redesign of the HTTP layer. If haproxy is efficient it's because it ensures it passes data from one end to another as fast as possible, and by doing so it never keeps a track of an already forwarded request. See it as a router. You wouldn't expect your router to retransmit the packets that have been lost. Here it's the same. So L7 retries will require to keep a copy of the request that is being forwarded, in hope to be able to retry it. It will obviously be limited to short requests which entirely fit into a buffer (thus not large POSTs). And it will also significantly increase memory usage since the copy of the request will have to be kept for the time it takes to the server to respond. So basically instead of keeping it for one microsecond it will keep it for a few hundreds of milliseconds to a few seconds. This is why this will definitely not be enabled by default! Regards, Willy
Re: [PATCH] MEDIUM: reset lua transaction between http requests
Hi Tim, On Tue, Aug 28, 2018 at 10:27:34PM +0200, Tim Düsterhus wrote: > Willy, > > Am 25.08.2018 um 08:13 schrieb Willy Tarreau: > > Done,thanks. > > > > I just noticed that the reg-test still carried the old name (h*), > instead of the new one (b*), because Frederic renamed them in the mean > time, while this patch still was pending. You should rename lua/h1.* > to lua/b3.*. Well, that's exactly why I prefer patches over instructions. Having to apply various operations spread over multiple e-mails to some data is confusing and time-consuming. Often it's much easier for the reporter to simply update a patch than to explain all has to be done. And sometimes it even highlights remaining issues that are not necessarily noticed before the patch is written. > And looking at the file names I am thinking whether the current naming > scheme of an incrementing number is a good one. It will probably cause a > ton of merge conflicts (when reg-tests are regularly provided) in the > future. I thought exactly the same while merging it, but given that we're almost all convinced that the reg-test directory's organization will continue to evolve a bit until we find the best one, I was not too much concerned. > It might make sense to either use the timestamp for the > filenames or a date + short slug representing the test description: > > b_20180828_txn-get-priv-scope.vtc That could be an option indeed, especially for bugs. Another aspect I was thinking about is to backport the reg-test stuff once we're pretty satisfied with the organization so that we can use it to check if a version is affected by the bug and if the backported fix properly addresses it. We just need to remind that sometimes these tests will have to be adapted during the backports, either because the original one uses options not available in the older versions, or just because it isn't triggered exactly the same way. Cheers, Willy
Re: [PATCH] MEDIUM: reset lua transaction between http requests
Willy, Am 25.08.2018 um 08:13 schrieb Willy Tarreau: > Done,thanks. > I just noticed that the reg-test still carried the old name (h*), instead of the new one (b*), because Frederic renamed them in the mean time, while this patch still was pending. You should rename lua/h1.* to lua/b3.*. And looking at the file names I am thinking whether the current naming scheme of an incrementing number is a good one. It will probably cause a ton of merge conflicts (when reg-tests are regularly provided) in the future. It might make sense to either use the timestamp for the filenames or a date + short slug representing the test description: b_20180828_txn-get-priv-scope.vtc Best regards Tim Düsterhus
Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25
Hi Frederic, Op 28-8-2018 om 11:27 schreef Frederic Lecaille: On 08/27/2018 10:46 PM, PiBa-NL wrote: Hi Frederic, Oliver, Thanks for your investigations :). I've made a little reg-test (files attached). Its probably not 'correct' to commit as-is, but should be enough to get a reproduction.. I hope.. changing it to nbthread 1 makes it work every time..(that i tried) The test actually seems to show a variety of issues. ## Every once in a while it takes like 7 seconds to run a test.. During which cpu usage is high.. do you think we can reproduce this 200% CPU usage issue after having disabled ssl With ssl 'disabled' i can run the test 500 times without a single failure.. As for the cpu usage issue it does not seem to reproduce 'easily' when running inside varnishtest.. But that might also be because it dumps its core most of the time.. Using the same config that varnishtest generated, and then changing the ports to :80 (for frontend) and 81 (for stats) then manually running haproxy -f /tmp/vtc.132.456/h1/cfg after a few curl requests curl hangs waiting for haproxy's response which is running 100% cpu.. Below 2 backtraces one of 100% cpu usage, and one of a core dump. Does that help? Do you need the actual core+binary? Regards, PiBa-NL (Pieter) # Using 100% cpu: (gdb) info thread Id Target Id Frame * 1 LWP 101573 of process 28901 0x000801e11e3a in _kevent () from /lib/libc.so.7 2 LWP 100816 of process 28901 0x000801e11e3a in _kevent () from /lib/libc.so.7 3 LWP 101309 of process 28901 0x00080187a71d in ?? () from /usr/local/lib/liblua-5.3.so (gdb) thread 3 [Switching to thread 3 (LWP 101309 of process 28901)] #0 0x00080187a71d in ?? () from /usr/local/lib/liblua-5.3.so (gdb) bt full #0 0x00080187a71d in ?? () from /usr/local/lib/liblua-5.3.so No symbol table info available. #1 0x00080187acd7 in ?? () from /usr/local/lib/liblua-5.3.so No symbol table info available. #2 0x00080187b108 in ?? () from /usr/local/lib/liblua-5.3.so No symbol table info available. #3 0x000801873e30 in lua_gc () from /usr/local/lib/liblua-5.3.so No symbol table info available. #4 0x00438e45 in hlua_ctx_resume (lua=0x8024dbf80, yield_allowed=1) at src/hlua.c:1186 ret = 0 msg = 0x5a5306 "Hiu\360" trace = 0x7fffdfdfcc00 "" #5 0x0044887a in hlua_applet_http_fct (ctx=0x8024d4a80) at src/hlua.c:6716 si = 0x803081840 strm = 0x803081500 res = 0x803081570 rule = 0x80242d6e0 px = 0x8024c4400 hlua = 0x8024dbf80 blk1 = 0x7fffdfdfcca0 "" len1 = 34397581057 blk2 = 0x803081578 "" len2 = 34410599800 ---Type to continue, or q to quit--- ret = 0 #6 0x005a78a7 in task_run_applet (t=0x80242db40, context=0x8024d4a80, state=16385) at src/applet.c:49 app = 0x8024d4a80 si = 0x803081840 #7 0x005a49a6 in process_runnable_tasks () at src/task.c:384 t = 0x80242db40 state = 16385 ctx = 0x8024d4a80 process = 0x5a77f0 t = 0x80242db40 max_processed = 200 #8 0x0051a6b2 in run_poll_loop () at src/haproxy.c:2386 next = -2118609833 exp = -2118610700 #9 0x00517672 in run_thread_poll_loop (data=0x8024843c8) at src/haproxy.c:2451 start_lock = {lock = 0, info = {owner = 0, waiters = 0, last_location = {function = 0x0, file = 0x0, line = 0}}} ptif = 0x8c1980 ptdf = 0x800f177cc #10 0x000800f12bc5 in ?? () from /lib/libthr.so.3 No symbol table info available. #11 0x in ?? () No symbol table info available. Backtrace stopped: Cannot access memory at address 0x7fffdfdfd000 ## Core dump: gdb --core haproxy.core /usr/local/sbin/haproxy GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `haproxy -f /tmp/vtc.28884.6c5c88f3/h1/cfg'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libcrypt.so.5...done. Loaded symbols for /lib/libcrypt.so.5 Reading symbols from /lib/libz.so.6...done. Loaded symbols for /lib/libz.so.6 Reading symbols from /lib/libthr.so.3...done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /usr/lib/libssl.so.8...done. Loaded symbols for /usr/lib/libssl.so.8 Reading symbols from /lib/libcrypto.so.8...done. Loaded symbols for /lib/libcrypto.so.8 Reading symbols from /usr/local/lib/liblua-5.3.so...done. Loaded symbols for /usr/local/lib/liblua-5.3.so Reading symbols from /lib/libm.so.5...done. Loaded symbols for /lib/libm.so.5 Reading symbols from
Re: Haproxy 1.8 segfaults on misconfigured set server fqdn command
On 08/14/2018 11:27 AM, Lukas Tribus wrote: Hello, Hi, the "set server / fqdn " admin socket command requires the internal DNS resolver to be configured and enabled for that specific server. This is undocumented, and I will provide a doc fix soon. However, when the resolver is not configured, and when haproxy is compiled with thread support, after issuing the set server fqdn admin socket command, haproxy segfaults (from haproxy 1.8.0 to current 1.9 head): As this bug came with b418c122 I take a look at it. It has been fixed then came back with thread support. Reg testing file provided. Fred. >From 995007d2edb8c296761bcf9922413e377f295b94 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20L=C3=A9caille?= Date: Tue, 21 Aug 2018 15:04:23 +0200 Subject: [PATCH] Subject: BUG/MINOR: server: Crash when setting FQDN via CLI. This patch ensures that a DNS resolution may be launched before setting a server FQDN via the CLI. Especially, it checks that resolvers was set. A LEVEL 4 reg testing file is provided. Thanks to Lukas Tribus for having reported this issue. Must be backported to 1.8. --- reg-tests/server/b0.vtc | 32 src/server.c| 4 2 files changed, 36 insertions(+) create mode 100644 reg-tests/server/b0.vtc diff --git a/reg-tests/server/b0.vtc b/reg-tests/server/b0.vtc new file mode 100644 index ..a746dbea --- /dev/null +++ b/reg-tests/server/b0.vtc @@ -0,0 +1,32 @@ +varnishtest "Set server FQDN via CLI crash" + +feature ignore_unknown_macro + +# Do nothing. Is there only to create s1_* macros +server s1 { +} -start + +haproxy h1 -conf { +defaults +mode http +timeout connect 1s +timeout client 1s +timeout server 1s + +frontend myfrontend +bind "fd@${my_fe}" +default_backend test + +backend test +server www1 ${s1_addr}:${s1_port} +} -start + +haproxy h1 -cli { +send "set server test/www1 fqdn foo.fqdn" +expect ~ "could not update test/www1 FQDN by 'stats socket command'" +send "show servers state test" +expect ~ "test 1 www1 ${s1_addr} .* - ${s1_port}" +} -wait + + + diff --git a/src/server.c b/src/server.c index 78d5a0fc..9319d71f 100644 --- a/src/server.c +++ b/src/server.c @@ -3928,6 +3928,10 @@ int srv_set_fqdn(struct server *srv, const char *hostname, int dns_locked) char *hostname_dn; inthostname_len, hostname_dn_len; + /* Note that the server lock is already held. */ + if (!srv->resolvers) + return -1; + if (!dns_locked) HA_SPIN_LOCK(DNS_LOCK, >resolvers->lock); /* run time DNS resolution was not active for this server -- 2.11.0
Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25
On Tue, Aug 28, 2018 at 02:47:28PM +0200, Olivier Houchard wrote: > Ok you're right, I have a patch for that problem, which should definitively > be different from Pieter's problem :) > Willy, I think it's safe to be applied, and should probably be backported > (albeit it should be adapted, given the API differences with buffers/channels) > to 1.8 and 1.7, I've been able to reproduce the problem on both. Looks good, now applied, thanks! Willy
Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25
Hi, On Mon, Aug 27, 2018 at 03:26:50PM +0200, Frederic Lecaille wrote: [...] > > According to Pieter traces, haproxy has registered HTTP service mode lua > applets in HTTP mode. Your patch fixes a TCP service mode issue. > reg-test/lua/b1.vtc script runs both HTTP and TCP lua applets. But this > is the TCP mode one which makes sometimes fail this script. > > > > > It seems one thread is stuck in lua_gc() while holding the global LUA > > > > lock, > > > > but I don't know enough about LUA to guess what is going on. > > > > > > What is suspect is that the HTTP and TCP applet functions > > > hlua_applet_(http|tcp)_fct() are called several times even when the applet > > > is done, or when the streams are disconnected or closed: > > > > > > /* If the stream is disconnect or closed, ldo nothing. */ > > > if (unlikely(si->state == SI_ST_DIS || si->state == SI_ST_CLO)) > > > return; > > > > > > this leads to call hlua_ctx_resume() several times from the same thread I > > > guess. > > > > > > > But if hlua_applet_(http|tcp)_fct() just returns, who calls > > hlua_ctx_resume() ? :) > > hlua_applet_(http|tcp)_fct() functions. If your run the script previously > mentioned, when it fails this is because hlua_applet_*tcp*_fct() is > infinitely called. > > Ok you're right, I have a patch for that problem, which should definitively be different from Pieter's problem :) Willy, I think it's safe to be applied, and should probably be backported (albeit it should be adapted, given the API differences with buffers/channels) to 1.8 and 1.7, I've been able to reproduce the problem on both. Regards, Olivier >From bf62441f9d0b305e16a74dbe3341ee7933c04761 Mon Sep 17 00:00:00 2001 From: Olivier Houchard Date: Tue, 28 Aug 2018 14:41:31 +0200 Subject: [PATCH] BUG/MEDIUM: hlua: Make sure we drain the output buffer when done. In hlua_applet_tcp_fct(), drain the output buffer when the applet is done running, every time we're called. Overwise, there's a race condition, and the output buffer could be filled after the applet ran, and as it is never cleared, the stream interface will never be destroyed. This should be backported to 1.8 and 1.7. --- src/hlua.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/hlua.c b/src/hlua.c index edb4f68c..7bbc854d 100644 --- a/src/hlua.c +++ b/src/hlua.c @@ -6446,8 +6446,11 @@ static void hlua_applet_tcp_fct(struct appctx *ctx) struct hlua *hlua = ctx->ctx.hlua_apptcp.hlua; /* The applet execution is already done. */ - if (ctx->ctx.hlua_apptcp.flags & APPLET_DONE) + if (ctx->ctx.hlua_apptcp.flags & APPLET_DONE) { + /* eat the whole request */ + co_skip(si_oc(si), co_data(si_oc(si))); return; + } /* If the stream is disconnect or closed, ldo nothing. */ if (unlikely(si->state == SI_ST_DIS || si->state == SI_ST_CLO)) -- 2.14.3
Retry for idempotent backends
This is partly a FAQ, but hopefully it goes beyond that…into a more subtle FAQ, perhaps. I understand that in general it is dangerous for haproxy to retry a request once it has been sent to a server, for example after a timeout or error, because the server may not be idempotent. I get this objection. However there are some backends that are designed to be idempotent, or where the consequences of retrying an operation are less bad than the consequences of clients timing out. I have servers supporting an AJAX API over HTTP which are designed in this way, partly because HTTP client retries can result in duplicate POST requests anyway. It would be lovely if there was an option to retry in such situations, with suitable caveats. Edward (Originally posted at https://discourse.haproxy.org/t/retry-for-idempotent-backends/2896/2 but it was suggested I bring it here.)
Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25
On 08/27/2018 10:46 PM, PiBa-NL wrote: Hi Frederic, Oliver, Thanks for your investigations :). I've made a little reg-test (files attached). Its probably not 'correct' to commit as-is, but should be enough to get a reproduction.. I hope.. changing it to nbthread 1 makes it work every time..(that i tried) The test actually seems to show a variety of issues. ## Every once in a while it takes like 7 seconds to run a test.. During which cpu usage is high.. do you think we can reproduce this 200% CPU usage issue after having disabled ssl like that: diff --git a/reg-tests/lua/b2.lua b/reg-tests/lua/b2.lua index 1053430f..c623d229 100644 --- a/reg-tests/lua/b2.lua +++ b/reg-tests/lua/b2.lua @@ -164,7 +164,7 @@ end core.register_service("fakeserv", "http", function(applet) core.Info("APPLET START") - local mc = Luacurl("127.0.0.1",8443, true) + local mc = Luacurl("127.0.0.1",8443, false) local headers = {} local body = "" core.Info("APPLET GET") diff --git a/reg-tests/lua/b2.vtc b/reg-tests/lua/b2.vtc index 1d634d56..11d4d5ae 100644 --- a/reg-tests/lua/b2.vtc +++ b/reg-tests/lua/b2.vtc @@ -2,6 +2,11 @@ varnishtest "Lua: txn:get_priv() scope" feature ignore_unknown_macro haproxy h1 -conf { + defaults + timeout connect 1s + timeout client 1s + timeout server 1s + global nbthread 3 lua-load ${testdir}/b2.lua @@ -14,7 +19,7 @@ haproxy h1 -conf { frontend fe2 mode http -bind ":8443" ssl crt ${testdir}/common.pem +bind ":8443" #ssl crt ${testdir}/common.pem stats enable stats uri /
Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25
On 08/27/2018 10:46 PM, PiBa-NL wrote: Hi Frederic, Oliver, Hi Pieter, Thanks for your investigations :). I've made a little reg-test (files attached). Its probably not 'correct' to commit as-is, but should be enough to get a reproduction.. I hope.. changing it to nbthread 1 makes it work every time..(that i tried) Your script is correct. Thank you a lot for this Pieter. The test actually seems to show a variety of issues. ## Every once in a while it takes like 7 seconds to run a test.. During which cpu usage is high.. Sounds like the first issue you reported. You can use -t varnistest option to set a large timeout so that you might have enough time to kill varnistest (Ctrl+C) to prevent it to kill haproxy. Then you can attach gdb to the haproxy process. c0 7.6 HTTP rx timeout (fd:5 7500 ms) ## But most of the time, it just doesn't finish with a correct result (ive seen haproxy do core dumps also while testing..). There is of course the option that i did something wrong in the lua as well... Does the test itself work for you guys? (with nbthread 1) I have not managed to make this script fails with "nbthread 1". I have also seen coredumps with "nbthread 3" even with only one HTTP request from c0 client: client c0 -connect ${h1_fe1_sock} { txreq -url "/" rxresp expect resp.status == 200 } If you run varnishtest with -l option, it leaves the temporary vtc.* directory if the test failed. If you set your environment to produce coredumps (ulimit -c unlimited) you should find coredump files in /tmp/vtc.*// directory (/tmp/vtc.*/h1/ in our case). According to gdb we have an issue in src/ssl_sock.c. So I have CC this mail to Emeric: Reading symbols from haproxy...done. [New LWP 32432] [New LWP 32431] [New LWP 32428] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/home/flecaille/src/haproxy/haproxy -d -f /tmp/vtc.32410.6f80f987/h1/cfg'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x7f78f98bba56 in ASN1_get_object () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 [Current thread is 1 (Thread 0x7f78f8522700 (LWP 32432))] (gdb) bt full #0 0x7f78f98bba56 in ASN1_get_object () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #1 0x7f78f98c2ff8 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #2 0x7f78f98c41b5 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #3 0x7f78f98c4ead in ASN1_item_ex_d2i () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #4 0x7f78f98c4f2b in ASN1_item_d2i () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 No symbol table info available. #5 0x7f78f9cdac98 in d2i_SSL_SESSION () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 No symbol table info available. #6 0x55e2078be006 in ssl_sock_init (conn=0x7f78e8012220) at src/ssl_sock.c:4985 ptr = 0xf800 address 0xf800> sess = may_retry = conn = 0x7f78e8012220 #7 0x55e20797dfc1 in conn_xprt_init (conn=0x7f78e8012220) at include/proto/connection.h:84 ret = 0 #8 tcp_connect_server (conn=0x7f78e8012220, data=0, delack=) at src/proto_tcp.c:545 fd = 18 srv = be = 0x55e207c567e0 src = #9 0x55e207981aba in si_connect (si=0x7f78e8017680) at include/proto/stream_interface.h:366 ret = 0 #10 connect_server (s=s@entry=0x7f78e80173b0) at src/backend.c:1223 cli_conn = 0x0 srv_conn = 0x7f78e8012220 srv_cs = old_cs = reuse = err = #11 0x55e207924295 in sess_update_stream_int (s=0x7f78e80173b0) at src/stream.c:885 conn_err = si = 0x7f78e8017680 req = 0x7f78e80173c0 #12 process_stream (t=, context=0x7f78e80173b0, state=) at src/stream.c:2240 s = 0x7f78e80173b0 sess = rqf_last = rpf_last = 2147483648 rq_prod_last = rq_cons_last = rp_cons_last = rp_prod_last = req_ana_back = req = 0x7f78e80173c0 res = 0x7f78e8017420 si_f = 0x7f78e8017638 si_b = 0x7f78e8017680 #13 0x55e2079ab1f8 in process_runnable_tasks () at src/task.c:381 t = state = ctx = ---Type to continue, or q to quit--- process = t = max_processed = #14 0x55e207959c51 in run_poll_loop () at src/haproxy.c:2386 next = exp = #15 run_thread_poll_loop (data=) at src/haproxy.c:2451 ptif = ptdf = start_lock = 0 #16 0x7f78f9f27494 in start_thread (arg=0x7f78f8522700) at pthread_create.c:333 __res = pd = 0x7f78f8522700 now = unwind_buf = {cancel_jmp_buf