Re: [PATCH] Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-26 Thread Thierry FOURNIER
On Sat, 26 May 2018 19:47:54 +0200
PiBa-NL  wrote:

> Hi Thierry,
> 
> Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:
> > On Fri, 18 May 2018 22:17:00 +0200
> > PiBa-NL  wrote:
> >
> >> Hi Thierry,
> >>
> >> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
> >>> Hi Pieter,
> >>>
> >>> Could you test the attached patch ? It seems to fix the problem, but I
> >>> have some doubts about the reliability of the patch.
> >>>
> >>> Thierry
> >> The crash seems 'fixed' indeed.. but the lua scipt below now takes
> >> 5seconds instead of 150ms.
> >>
> >> Regards,
> >> PiBa-NL (Pieter)
> >>
> >> con = core.tcp()
> >> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
> >> request = [[GET / HTTP/1.0
> >>
> >> ]]
> >> con:send(request)
> >> res = con:receive("*a")
> >> con:close()
> > One bug can hide another bug :-) I catch both. Could you test ?
> >
> > If the result is positive I join also the backport for 1.6 and 1.7
> >
> > Thierry
> 
> Thanks, seems both the hang and the 2nd uncovered task schedule issue 
> are fixed now (google website response is received/processed fast 
> again). I've done some testing, and installed the updated/patched 
> version on my production box last night. At the moment it still works 
> properly. Activated my lua healthchecker and mailer tasks and enabled 3 
> threads again.. Lets see how it go's :), but as said for now it seems to 
> work alright.
> 
> Does the second issue you found and fixed clear the initial 'doubts 
> about the reliability' of the first one.? Or did you have a particular 
> possibly problematic scenario in mind that i could try and check for?
> 
> For the moment i think it is more 'reliable / stable' with the patches 
> than without. So in that regard i think they could be merged.
> 
> There seems to be a issue with tcp:settimeout() though. But ill put that 
> in a different mail-threat as im not sure its related and the issues 
> where this thread started are fixed.


I tried a case never encountered before. When the data sent is bigger than
an haproxy buffer, the send function is absolutely bugged, so I submit
some additional patches.

BR,
Thierry



> Regards,
> 
> PiBa-NL (Pieter)
> 
>From 4f8fb1ea81e9e33005f52754775b6dada035bf22 Mon Sep 17 00:00:00 2001
From: Thierry FOURNIER 
Date: Fri, 25 May 2018 14:38:57 +0200
Subject: [PATCH 1/5] BUG/MEDIUM: lua/socket: wrong scheduling for sockets

The appctx pointer inherit from pre-thread dev. The thread gives new
link system called xref which maintain a link between the socket ignitor
and the applet. appctx is given at once at the start fo the function.

Inheritage import wrong method for getting the appctx. This implies
the wakeup of wrong applet, and the socket are no longer responsive.

This behavior is hidden by another inherited error whic is fixed in the
next patch.

This patch remove all wrong appctx affectations.

This patch must be backported in 1.6, 1.7 and 1.8
---
 src/hlua.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/hlua.c b/src/hlua.c
index 4de5db5ac..03e961c37 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -1799,7 +1799,6 @@ no_peer:
 
 connection_empty:
 
-	appctx = objt_appctx(s->si[0].end);
 	if (!notification_new(>com, >ctx.hlua_cosocket.wake_on_read, hlua->task)) {
 		xref_unlock(>xref, peer);
 		WILL_LJMP(luaL_error(L, "out of memory"));
@@ -1951,7 +1950,6 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	 * the request buffer if its not required.
 	 */
 	if (s->req.buf->size == 0) {
-		appctx = hlua->task->context;
 		if (!channel_alloc_buffer(>req, >buffer_wait))
 			goto hlua_socket_write_yield_return;
 	}
@@ -1959,7 +1957,6 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	/* Check for avalaible space. */
 	len = buffer_total_space(s->req.buf);
 	if (len <= 0) {
-		appctx = objt_appctx(s->si[0].end);
 		if (!notification_new(>com, >ctx.hlua_cosocket.wake_on_write, hlua->task)) {
 			xref_unlock(>xref, peer);
 			WILL_LJMP(luaL_error(L, "out of memory"));
-- 
2.16.3

>From cce5f285c78c3bacd24212ccfca45a460dc50090 Mon Sep 17 00:00:00 2001
From: Thierry FOURNIER 
Date: Fri, 25 May 2018 15:03:50 +0200
Subject: [PATCH 2/5] BUG/MAJOR: lua: Dead lock with sockets

In some cases, when we are waiting for data and the socket
timeout expires, we have a dead lock. The Lua socket locks
the applet socket, and call for a notify. The notify
immediately executes code and try to acquire the same lock,
so ... dead lock.

stream_int_notify() cant be used because it wakeup the applet
task only if the stream have changes. The changes are forces
by Lua, but not repported on the stream.

stream_int_update_applet() cant be used because the deadlock.

So, I inconditionnaly wakeup the applet. This wake is performed
asynchronously, and will call a stream_int_notify().

This patch must be backported in 1.6, 1.7 and 1.8
---
 

Re: [PATCH] Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-26 Thread Thierry Fournier


> On 26 May 2018, at 19:47, PiBa-NL  wrote:
> 
> Hi Thierry,
> 
> Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:
>> On Fri, 18 May 2018 22:17:00 +0200
>> PiBa-NL  wrote:
>> 
>>> Hi Thierry,
>>> 
>>> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
 Hi Pieter,
 
 Could you test the attached patch ? It seems to fix the problem, but I
 have some doubts about the reliability of the patch.
 
 Thierry
>>> The crash seems 'fixed' indeed.. but the lua scipt below now takes
>>> 5seconds instead of 150ms.
>>> 
>>> Regards,
>>> PiBa-NL (Pieter)
>>> 
>>> con = core.tcp()
>>> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
>>> request = [[GET / HTTP/1.0
>>> 
>>> ]]
>>> con:send(request)
>>> res = con:receive("*a")
>>> con:close()
>> One bug can hide another bug :-) I catch both. Could you test ?
>> 
>> If the result is positive I join also the backport for 1.6 and 1.7
>> 
>> Thierry
> 
> Thanks, seems both the hang and the 2nd uncovered task schedule issue are 
> fixed now (google website response is received/processed fast again). I've 
> done some testing, and installed the updated/patched version on my production 
> box last night. At the moment it still works properly. Activated my lua 
> healthchecker and mailer tasks and enabled 3 threads again.. Lets see how it 
> go's :), but as said for now it seems to work alright.
> 
> Does the second issue you found and fixed clear the initial 'doubts about the 
> reliability' of the first one.? Or did you have a particular possibly 
> problematic scenario in mind that i could try and check for?


Yes. On the first version, I just tried something without
confidence, and this version I read all the socket code,
so I’m confident.


> For the moment i think it is more 'reliable / stable' with the patches than 
> without. So in that regard i think they could be merged.
> 
> There seems to be a issue with tcp:settimeout() though. But ill put that in a 
> different mail-threat as im not sure its related and the issues where this 
> thread started are fixed.


I saw this problem. I tried to understand it, but without success.
The timeout applied seems to be the hardcoded default timeout. I
check this later.

BR,
Thierry


Re: [PATCH] Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-26 Thread PiBa-NL

Hi Thierry,

Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:

On Fri, 18 May 2018 22:17:00 +0200
PiBa-NL  wrote:


Hi Thierry,

Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:

Hi Pieter,

Could you test the attached patch ? It seems to fix the problem, but I
have some doubts about the reliability of the patch.

Thierry

The crash seems 'fixed' indeed.. but the lua scipt below now takes
5seconds instead of 150ms.

Regards,
PiBa-NL (Pieter)

con = core.tcp()
con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
request = [[GET / HTTP/1.0

]]
con:send(request)
res = con:receive("*a")
con:close()

One bug can hide another bug :-) I catch both. Could you test ?

If the result is positive I join also the backport for 1.6 and 1.7

Thierry


Thanks, seems both the hang and the 2nd uncovered task schedule issue 
are fixed now (google website response is received/processed fast 
again). I've done some testing, and installed the updated/patched 
version on my production box last night. At the moment it still works 
properly. Activated my lua healthchecker and mailer tasks and enabled 3 
threads again.. Lets see how it go's :), but as said for now it seems to 
work alright.


Does the second issue you found and fixed clear the initial 'doubts 
about the reliability' of the first one.? Or did you have a particular 
possibly problematic scenario in mind that i could try and check for?


For the moment i think it is more 'reliable / stable' with the patches 
than without. So in that regard i think they could be merged.


There seems to be a issue with tcp:settimeout() though. But ill put that 
in a different mail-threat as im not sure its related and the issues 
where this thread started are fixed.


Regards,

PiBa-NL (Pieter)




[PATCH] Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-25 Thread Thierry FOURNIER
On Fri, 18 May 2018 22:17:00 +0200
PiBa-NL  wrote:

> Hi Thierry,
> 
> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
> > Hi Pieter,
> >
> > Could you test the attached patch ? It seems to fix the problem, but I
> > have some doubts about the reliability of the patch.
> >
> > Thierry
> The crash seems 'fixed' indeed.. but the lua scipt below now takes 
> 5seconds instead of 150ms.
> 
> Regards,
> PiBa-NL (Pieter)
> 
> con = core.tcp()
> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
> request = [[GET / HTTP/1.0
> 
> ]]
> con:send(request)
> res = con:receive("*a")
> con:close()

One bug can hide another bug :-) I catch both. Could you test ?

If the result is positive I join also the backport for 1.6 and 1.7

Thierry
>From 498b9d4e76b599e533ff31d0d3c14dd1726d00f5 Mon Sep 17 00:00:00 2001
From: Thierry FOURNIER 
Date: Fri, 25 May 2018 14:38:57 +0200
Subject: [PATCH 1/2] BUG/MEDIUM: lua/socket: wrong scheduling for sockets

The appctx pointer inherit from pre-thread dev. The thread gives new
link system called xref which maintain a link between the socket ignitor
and the applet. appctx is given at once at the start fo the function.

Inheritage import wrong method for getting the appctx. This implies
the wakeup of wrong applet, and the socket are no longer responsive.

This behavior is hidden by another inherited error whic is fixed in the
next patch.

This patch remove all wrong appctx affectations.

This patch must be backported in 1.6, 1.7 and 1.8
---
 src/hlua.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/hlua.c b/src/hlua.c
index a476bcc3d..f8d3836ef 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -1689,6 +1689,8 @@ __LJMP static int hlua_socket_receive_yield(struct lua_State *L, int status, lua
 	if (!socket->s)
 		goto connection_closed;
 
+	appctx = objt_appctx(socket->s->si[0].end);
+
 	oc = >s->res;
 	if (wanted == HLSR_READ_LINE) {
 		/* Read line. */
@@ -1785,7 +1787,6 @@ connection_closed:
 
 connection_empty:
 
-	appctx = objt_appctx(socket->s->si[0].end);
 	if (!hlua_com_new(hlua, >ctx.hlua.wake_on_read))
 		WILL_LJMP(luaL_error(L, "out of memory"));
 	WILL_LJMP(hlua_yieldk(L, 0, 0, hlua_socket_receive_yield, TICK_ETERNITY, 0));
@@ -1894,6 +1895,8 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 		return 1;
 	}
 
+	appctx = objt_appctx(socket->s->si[0].end);
+
 	/* Update the input buffer data. */
 	buf += sent;
 	send_len = buf_len - sent;
@@ -1906,7 +1909,6 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	 * the request buffer if its not required.
 	 */
 	if (socket->s->req.buf->size == 0) {
-		appctx = hlua->task->context;
 		if (!channel_alloc_buffer(>s->req, >buffer_wait))
 			goto hlua_socket_write_yield_return;
 	}
@@ -1914,7 +1916,6 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	/* Check for avalaible space. */
 	len = buffer_total_space(socket->s->req.buf);
 	if (len <= 0) {
-		appctx = objt_appctx(socket->s->si[0].end);
 		if (!hlua_com_new(hlua, >ctx.hlua.wake_on_write))
 			WILL_LJMP(luaL_error(L, "out of memory"));
 		goto hlua_socket_write_yield_return;
-- 
2.16.3

>From d61022902ff1489f9343a3f430ad1b2b9efd23c7 Mon Sep 17 00:00:00 2001
From: Thierry FOURNIER 
Date: Fri, 25 May 2018 15:03:50 +0200
Subject: [PATCH 2/2] BUG/MAJOR: lua: Dead lock with sockets

In some cases, when we are waiting for data and the socket
timeout expires, we have a dead lock. The Lua socket locks
the applet socket, and call for a notify. The notify
immediately executes code and try to acquire the same lock,
so ... dead lock.

stream_int_notify() cant be used because it wakeup the applet
task only if the stream have changes. The changes are forces
by Lua, but not repported on the stream.

stream_int_update_applet() cant be used because the deadlock.

So, I inconditionnaly wakeup the applet. This wake is performed
asynchronously, and will call a stream_int_notify().

This patch must be backported in 1.6, 1.7 and 1.8
---
 src/hlua.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/hlua.c b/src/hlua.c
index f8d3836ef..de5afa6c5 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -1759,8 +1759,7 @@ __LJMP static int hlua_socket_receive_yield(struct lua_State *L, int status, lua
 	bo_skip(oc, len + skip_at_end);
 
 	/* Don't wait anything. */
-	stream_int_notify(>s->si[0]);
-	stream_int_update_applet(>s->si[0]);
+	appctx_wakeup(appctx);
 
 	/* If the pattern reclaim to read all the data
 	 * in the connection, got out.
@@ -1942,8 +1941,7 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	}
 
 	/* update buffers. */
-	stream_int_notify(>s->si[0]);
-	stream_int_update_applet(>s->si[0]);
+	appctx_wakeup(appctx);
 
 	socket->s->req.rex = TICK_ETERNITY;
 	socket->s->res.wex = TICK_ETERNITY;
-- 
2.16.3

>From 

Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-18 Thread PiBa-NL

Hi Thierry,

Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:

Hi Pieter,

Could you test the attached patch ? It seems to fix the problem, but I
have some doubts about the reliability of the patch.

Thierry
The crash seems 'fixed' indeed.. but the lua scipt below now takes 
5seconds instead of 150ms.


Regards,
PiBa-NL (Pieter)

con = core.tcp()
con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
request = [[GET / HTTP/1.0

]]
con:send(request)
res = con:receive("*a")
con:close()




Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-18 Thread Thierry FOURNIER
Hi Pieter,

Could you test the attached patch ? It seems to fix the problem, but I
have some doubts about the reliability of the patch.

Thierry




On Fri, 11 May 2018 20:13:25 +0200
PiBa-NL  wrote:

> Hi Thierry,
> 
> Okay found a simple reproduction with tcploop with a 6 second delay in 
> there and a short sleep before calling kqueue.
> 
> ./tcploop 81 L W N20 A R S:"response1\r\n" R P6000 S:"response2\r\n" R [ 
> F K ]
> 
>   gettimeofday(_poll, NULL);
> +    usleep(100);
>   status = kevent(kqueue_fd[tid], // int kq
> 
> Together with the attached config the issue is reproduced every time the 
> /myapplet url is requested.
> 
> Output as below:
> :stats.clihdr[0007:]: Accept-Language: 
> nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
> [info] 130/195936 (76770) : Wait for it..
> [info] 130/195937 (76770) : Wait response 2..
>    xref_get_peer_and_lock xref->peer == 1
> 
> Hope this helps to come up with a solution..
> 
> Thanks in advance,
> PiBa-NL (Pieter)
> 
> Op 9-5-2018 om 19:47 schreef PiBa-NL:
> > Hi Thierry,
> >
> > Op 9-5-2018 om 18:30 schreef Thierry Fournier:
> >> It seems a dead lock, but you observe a loop. 
> > Effectively it is a deadlock, it keeps looping over these few lines of 
> > code below from xref.h 
> > ..
> >  
> > The XCHG just swaps the 2 values (both are '1') and continues on, then 
> > the local==BUSY check is true it loops and swaps 1 and 1 again, and 
> > the circle continues..
> >
> > Thanks for looking into it :) Ill try and get 'simpler' reproduction 
> > with some well placed sleep() as you suggest.
> > Regards,
> > PiBa-NL
> >
> > http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b
> >  
> >
> 
>From ea4ad78120759351f203797ca263b8247edd88a5 Mon Sep 17 00:00:00 2001
From: Thierry FOURNIER 
Date: Fri, 18 May 2018 16:22:20 +0200
Subject: [PATCH 1/2] BUG/MEDIUM: lua: Dead-lock with socket

In some cases, ... fill later
---
 src/hlua.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/hlua.c b/src/hlua.c
index 8cc305138..15a5397e1 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -1781,7 +1781,6 @@ __LJMP static int hlua_socket_receive_yield(struct lua_State *L, int status, lua
 	co_skip(oc, len + skip_at_end);
 
 	/* Don't wait anything. */
-	stream_int_notify(>si[0]);
 	stream_int_update_applet(>si[0]);
 
 	/* If the pattern reclaim to read all the data
@@ -2004,7 +2003,6 @@ static int hlua_socket_write_yield(struct lua_State *L,int status, lua_KContext
 	}
 
 	/* update buffers. */
-	stream_int_notify(>si[0]);
 	stream_int_update_applet(>si[0]);
 
 	s->req.rex = TICK_ETERNITY;
-- 
2.16.3



Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-11 Thread PiBa-NL

Hi Thierry,

Okay found a simple reproduction with tcploop with a 6 second delay in 
there and a short sleep before calling kqueue.


./tcploop 81 L W N20 A R S:"response1\r\n" R P6000 S:"response2\r\n" R [ 
F K ]


 gettimeofday(_poll, NULL);
+    usleep(100);
 status = kevent(kqueue_fd[tid], // int kq

Together with the attached config the issue is reproduced every time the 
/myapplet url is requested.


Output as below:
:stats.clihdr[0007:]: Accept-Language: 
nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7

[info] 130/195936 (76770) : Wait for it..
[info] 130/195937 (76770) : Wait response 2..
  xref_get_peer_and_lock xref->peer == 1

Hope this helps to come up with a solution..

Thanks in advance,
PiBa-NL (Pieter)

Op 9-5-2018 om 19:47 schreef PiBa-NL:

Hi Thierry,

Op 9-5-2018 om 18:30 schreef Thierry Fournier:
It seems a dead lock, but you observe a loop. 
Effectively it is a deadlock, it keeps looping over these few lines of 
code below from xref.h 
.. 
The XCHG just swaps the 2 values (both are '1') and continues on, then 
the local==BUSY check is true it loops and swaps 1 and 1 again, and 
the circle continues..


Thanks for looking into it :) Ill try and get 'simpler' reproduction 
with some well placed sleep() as you suggest.

Regards,
PiBa-NL

http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b 



function myapplet(applet)

  core.Info("Wait for it..")
  core.sleep(1)

  local result = ""
con = core.tcp()
con:settimeout(1)
con:connect("127.0.0.1",81)
con:send("Test1\r\n")
r = con:receive("*l")
result = result .. tostring(r)
con:send("Test\r\n")
  core.Info("Wait response 2..")
r2 = con:receive("*l")
result = result .. tostring(r2)
  core.Info("close..")
con:close()
  core.Info("DONE")

response = "Finished"
applet:add_header("Server", "haproxy/webstats")
applet:add_header("Content-Type", "text/html")
applet:start_response()
applet:send(response)

end

core.register_service("myapplet", "http", myapplet)
global
  lua-load /root/haproxytest/hang_timeout_close.lua

defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 60s
  
frontend stats
bind *:80
stats enable
stats admin if TRUE
stats refresh 1s

  acl myapplet path -m beg /myapplet
  http-request use-service lua.myapplet if myapplet

 include/common/xref.h | 4 
 src/ev_kqueue.c   | 1 +
 2 files changed, 5 insertions(+)

diff --git a/include/common/xref.h b/include/common/xref.h
index 6dfa7b6..e6905a1 100644
--- a/include/common/xref.h
+++ b/include/common/xref.h
@@ -25,6 +25,10 @@ static inline void xref_create(struct xref *xref_a, struct 
xref *xref_b)
 
 static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
 {
+   if (xref->peer == 1) {
+   printf("  xref_get_peer_and_lock xref->peer == 1 \n");
+   }
+
struct xref *local;
struct xref *remote;
 
diff --git a/src/ev_kqueue.c b/src/ev_kqueue.c
index bf7f666..732f20d 100644
--- a/src/ev_kqueue.c
+++ b/src/ev_kqueue.c
@@ -145,6 +145,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
 
fd = global.tune.maxpollevents;
gettimeofday(_poll, NULL);
+   usleep(100);
status = kevent(kqueue_fd[tid], // int kq
NULL,  // const struct kevent *changelist
0, // int nchanges


Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-09 Thread PiBa-NL

Hi Thierry,

Op 9-5-2018 om 18:30 schreef Thierry Fournier:
It seems a dead lock, but you observe a loop. 
Effectively it is a deadlock, it keeps looping over these few lines of 
code below from xref.h 
.. 
The XCHG just swaps the 2 values (both are '1') and continues on, then 
the local==BUSY check is true it loops and swaps 1 and 1 again, and the 
circle continues..


Thanks for looking into it :) Ill try and get 'simpler' reproduction 
with some well placed sleep() as you suggest.

Regards,
PiBa-NL

http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b

31 
 
while (1) {
32 
 

33 
 
/* Get the local pointer to the peer. */
34 
 
local = HA_ATOMIC_XCHG(>peer, XREF_BUSY);
35 
 

36 
 
/* If the local pointer is NULL, the peer no longer exists. */
37 
 
if (local == NULL) {
38 
 
xref->peer = NULL;
39 
 
return NULL;
40 
 
}
41 
 

42 
 
/* If the local pointeru is BUSY, the peer try to acquire the
43 
 
 * lock. We retry the process.
44 
 
 */
45 
 
if (local == XREF_BUSY)
46 
 
continue;




Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-09 Thread Thierry Fournier


> On 9 May 2018, at 18:30, Thierry Fournier  
> wrote:
> 
> 
> 
>> On 8 May 2018, at 00:33, PiBa-NL  wrote:
>> 
>> Hi List, Thierry,
>> 
>> Actually this is not limited to restarts, and also happens with 1.9dev. It 
>> now happens while haproxy was running for a while and no restart was 
>> attempted while running/debugging in my NetBeans IDE..
>> 
>> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release 
>> both try and acquire the same lock.
>> 
>> 
>> For debugging purposes ive added some code in hlua_socket_receive_yield(..) 
>> before the stream_int_notify:
>> 
>>struct channel *ic2 = si_ic(si);
>>struct channel *oc2 = si_oc(si);
>>ha_warning("hlua_socket_receive_yield calling notify peer:%9x  
>> si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, 
>> oc2->flags, ic2->flags);
>>stream_int_notify(>si[0]);
>> 
>> And:
>> static void hlua_socket_release(struct appctx *appctx)
>> {
>>struct xref *peer;
>>if (appctx->ctx.hlua_cosocket.xref.peer > 1)
>>ha_warning("hlua_socket_release peer: %9x %9x\n", 
>> appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
>>else
>>ha_warning("hlua_socket_release peer: %9x 0\n", 
>> appctx->ctx.hlua_cosocket.xref);
>> 
>> 
>> And also added code in xref_get_peer_and_lock(..):
>> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
>> {
>>if (xref->peer == 1) {
>>printf("  xref_get_peer_and_lock xref->peer == 1 \n");
>>}
>> 
>> 
>> This produces the logging:
>> 
>> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify 
>> peer:  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
>> [WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
>>  xref_get_peer_and_lock xref->peer == 1 
>> 
>> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 
>> then it looks like it keeps swapping 1 and 1 until it is not 1, that never 
>> happens..
>> 
>> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 
>> 100% when that flag is exactly set to get a foolproof reproduction.. but it 
>> happens on pretty much a daily basis for me in production and in test i can 
>> now usually trigger it after a few testruns with no actual traffic passing 
>> along within the first minute of running (healthchecks are performed on 
>> several backend, and a mail or 2 is send by the lua code during this startup 
>> period..).. with the full production config..
>> 
>> Below the stacktrace that comes with it..
>> 
>> xref_get_peer_and_lock (xref=0x802355590) at 
>> P:\Git\haproxy\include\common\xref.h:37
>> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
>> si_applet_release (si=0x8023514c8) at 
>> P:\Git\haproxy\include\proto\stream_interface.h:233
>> stream_int_shutw_applet (si=0x8023514c8) at 
>> P:\Git\haproxy\src\stream_interface.c:1504
>> si_shutw (si=0x8023514c8) at 
>> P:\Git\haproxy\include\proto\stream_interface.h:320
>> stream_int_notify (si=0x8023514c8) at 
>> P:\Git\haproxy\src\stream_interface.c:465
>> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at 
>> P:\Git\haproxy\src\hlua.c:1789
>> ?? () at null:
>> ?? () at null:
>> lua_resume () at null:
>> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at 
>> P:\Git\haproxy\src\hlua.c:1022
>> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
>> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
>> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
>> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
>> main (argc=4, argv=0x7fffea80) at P:\Git\haproxy\src\haproxy.c:3053
>> 
>> I don't yet have a any idea about the direction of a possible fix.. :(..
>> Issue is that probably the hlua_socket_release should happen, but it doesnt 
>> know what socket / peer it should release at that point.. its in the local 
>> peer variable of the hlua_socket_receive_yield funtion.. Should it be 
>> 'unlocked' before calling stream_int_notify?
> 
> 
> This part of the code is tricky.
> 
> Just a precision:
> 
> - I write the “socket” session an haproxy classic session which is created by
>   Lua code for performing network I/O.
> 
> - I write “Lua” session for a classic HAProxy session which executes “Lua”, 
> and
>   this “Lua” code communicates with the “socket” session.
> 
> This lock have two goal: maintain the link between the Lua session and the 
> “socket”
> session in a coherent state and ensure the access of the “socket” buffer from 
> the
> Lua session. When the lock is acquired, the “socket” session can’t be cleaned.
> 
> I resume:
> 
> - “Lua” session is executed
> 
> - hlua_socket_receive_yield() acquire a lock
> 
> - hlua_socket_receive_yield() call stream_int_notify()
>   NOTE: the function hlua_socket_receive_yield() can’t close the “socket” 
> session
> 
> - stream_int_notify() 

Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-09 Thread Thierry Fournier


> On 8 May 2018, at 00:33, PiBa-NL  wrote:
> 
> Hi List, Thierry,
> 
> Actually this is not limited to restarts, and also happens with 1.9dev. It 
> now happens while haproxy was running for a while and no restart was 
> attempted while running/debugging in my NetBeans IDE..
> 
> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both 
> try and acquire the same lock.
> 
> 
> For debugging purposes ive added some code in hlua_socket_receive_yield(..) 
> before the stream_int_notify:
> 
> struct channel *ic2 = si_ic(si);
> struct channel *oc2 = si_oc(si);
> ha_warning("hlua_socket_receive_yield calling notify peer:%9x  
> si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, 
> oc2->flags, ic2->flags);
> stream_int_notify(>si[0]);
> 
> And:
> static void hlua_socket_release(struct appctx *appctx)
> {
> struct xref *peer;
> if (appctx->ctx.hlua_cosocket.xref.peer > 1)
> ha_warning("hlua_socket_release peer: %9x %9x\n", 
> appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
> else
> ha_warning("hlua_socket_release peer: %9x 0\n", 
> appctx->ctx.hlua_cosocket.xref);
> 
> 
> And also added code in xref_get_peer_and_lock(..):
> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
> {
> if (xref->peer == 1) {
> printf("  xref_get_peer_and_lock xref->peer == 1 \n");
> }
> 
> 
> This produces the logging:
> 
> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer: 
>  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
> [WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
>   xref_get_peer_and_lock xref->peer == 1 
> 
> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 
> then it looks like it keeps swapping 1 and 1 until it is not 1, that never 
> happens..
> 
> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 
> 100% when that flag is exactly set to get a foolproof reproduction.. but it 
> happens on pretty much a daily basis for me in production and in test i can 
> now usually trigger it after a few testruns with no actual traffic passing 
> along within the first minute of running (healthchecks are performed on 
> several backend, and a mail or 2 is send by the lua code during this startup 
> period..).. with the full production config..
> 
> Below the stacktrace that comes with it..
> 
> xref_get_peer_and_lock (xref=0x802355590) at 
> P:\Git\haproxy\include\common\xref.h:37
> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
> si_applet_release (si=0x8023514c8) at 
> P:\Git\haproxy\include\proto\stream_interface.h:233
> stream_int_shutw_applet (si=0x8023514c8) at 
> P:\Git\haproxy\src\stream_interface.c:1504
> si_shutw (si=0x8023514c8) at 
> P:\Git\haproxy\include\proto\stream_interface.h:320
> stream_int_notify (si=0x8023514c8) at 
> P:\Git\haproxy\src\stream_interface.c:465
> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at 
> P:\Git\haproxy\src\hlua.c:1789
> ?? () at null:
> ?? () at null:
> lua_resume () at null:
> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at 
> P:\Git\haproxy\src\hlua.c:1022
> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
> main (argc=4, argv=0x7fffea80) at P:\Git\haproxy\src\haproxy.c:3053
> 
> I don't yet have a any idea about the direction of a possible fix.. :(..
> Issue is that probably the hlua_socket_release should happen, but it doesnt 
> know what socket / peer it should release at that point.. its in the local 
> peer variable of the hlua_socket_receive_yield funtion.. Should it be 
> 'unlocked' before calling stream_int_notify?


This part of the code is tricky.

Just a precision:

 - I write the “socket” session an haproxy classic session which is created by
   Lua code for performing network I/O.

 - I write “Lua” session for a classic HAProxy session which executes “Lua”, and
   this “Lua” code communicates with the “socket” session.

This lock have two goal: maintain the link between the Lua session and the 
“socket”
session in a coherent state and ensure the access of the “socket” buffer from 
the
Lua session. When the lock is acquired, the “socket” session can’t be cleaned.

I resume:

 - “Lua” session is executed

 - hlua_socket_receive_yield() acquire a lock

 - hlua_socket_receive_yield() call stream_int_notify()
   NOTE: the function hlua_socket_receive_yield() can’t close the “socket” 
session

 - stream_int_notify() is executed from “Lua” session.

 - stream_int_notify() try to close the “socket” session

 - stream_int_notify() try to acquire the lock which is already acquired by the
   caller function hlua_socket_receive_yield().

It 

Re: 1.8.8 & 1.9dev, lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

2018-05-07 Thread PiBa-NL

Hi List, Thierry,

Actually this is not limited to restarts, and also happens with 1.9dev. 
It now happens while haproxy was running for a while and no restart was 
attempted while running/debugging in my NetBeans IDE..


Root cause imo is that hlua_socket_receive_yield and hlua_socket_release 
both try and acquire the same lock.



For debugging purposes ive added some code in 
hlua_socket_receive_yield(..) before the stream_int_notify:


    struct channel *ic2 = si_ic(si);
    struct channel *oc2 = si_oc(si);
    ha_warning("hlua_socket_receive_yield calling notify peer:%9x 
si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, 
oc2->flags, ic2->flags);

    stream_int_notify(>si[0]);

And:
static void hlua_socket_release(struct appctx *appctx)
{
    struct xref *peer;
    if (appctx->ctx.hlua_cosocket.xref.peer > 1)
        ha_warning("hlua_socket_release peer: %9x %9x\n", 
appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);

    else
        ha_warning("hlua_socket_release peer: %9x 0\n", 
appctx->ctx.hlua_cosocket.xref);



And also added code in xref_get_peer_and_lock(..):
static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
{
    if (xref->peer == 1) {
        printf("  xref_get_peer_and_lock xref->peer == 1 \n");
    }


This produces the logging:

[WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify 
peer:  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024

[WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
  xref_get_peer_and_lock xref->peer == 1

When xref_get_peer_and_lock is called with a parameter xref->peer value 
of 1 then it looks like it keeps swapping 1 and 1 until it is not 1, 
that never happens..


As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still 
not 100% when that flag is exactly set to get a foolproof reproduction.. 
but it happens on pretty much a daily basis for me in production and in 
test i can now usually trigger it after a few testruns with no actual 
traffic passing along within the first minute of running (healthchecks 
are performed on several backend, and a mail or 2 is send by the lua 
code during this startup period..).. with the full production config..


Below the stacktrace that comes with it..

xref_get_peer_and_lock (xref=0x802355590) at 
P:\Git\haproxy\include\common\xref.h:37

hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
si_applet_release (si=0x8023514c8) at 
P:\Git\haproxy\include\proto\stream_interface.h:233
stream_int_shutw_applet (si=0x8023514c8) at 
P:\Git\haproxy\src\stream_interface.c:1504
si_shutw (si=0x8023514c8) at 
P:\Git\haproxy\include\proto\stream_interface.h:320
stream_int_notify (si=0x8023514c8) at 
P:\Git\haproxy\src\stream_interface.c:465
hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at 
P:\Git\haproxy\src\hlua.c:1789

?? () at null:
?? () at null:
lua_resume () at null:
hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at 
P:\Git\haproxy\src\hlua.c:1022

hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
main (argc=4, argv=0x7fffea80) at P:\Git\haproxy\src\haproxy.c:3053

I don't yet have a any idea about the direction of a possible fix.. :(..
Issue is that probably the hlua_socket_release should happen, but it 
doesnt know what socket / peer it should release at that point.. its in 
the local peer variable of the hlua_socket_receive_yield funtion.. 
Should it be 'unlocked' before calling stream_int_notify??


Does anyone dare to take a stab at a creating a patch ? If so thanks in 
advance ;)


Regards,
PiBa-NL (Pieter)


Op 3-5-2018 om 1:30 schreef PiBa-NL:

Hi List,

Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf  
parameter) one of the processes seems to get into a 'hanging' state 
consuming 100% cpu..


In this configuration i'm using 'nbthread 1' not sure if this is 
related to the corrupted task-tree from my other lua issue.?. 
https://www.mail-archive.com/haproxy@formilux.org/msg29801.html .?.


Also i'm using my new smtpmailqueue and serverhealthchecker lua 
scripts (can be found on github.).. these probably 'contribute' to 
triggering the condition.


Anything i can check / provide.?

(cant really share the config itself a.t.m. as its from our production 
env, but it has like 15 backends with 1 server each, a little header 
rewriting/insertion but nothing big..)


GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show 
copying"

and "show warranty" for details.
This GDB was configured as