Re: High CPU Usage followed by segfault error

2018-10-20 Thread Soji Antony
Hi

FYI, following is the backtrace for segfault which we are seeing in the
kern.log.

Oct 18 10:11:30  kernel: [841364.001036] haproxy[30696]: segfault at 8 ip
5567eaf6aac2 sp 7ffdd70447b0 error 6 in haproxy[5567eae75000+172000]

# apport-retrace -g _usr_sbin_haproxy.0.crash
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Reading symbols from /usr/sbin/haproxy...Reading symbols from
/usr/lib/debug/.build-id/56/c5ffb3112d35c68a487caa1f4b788953891ade.debug...done.
done.
[New LWP 30696]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p
/var/run/haproxy.pid -sf 30646'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x5567eaf6aac2 in do_unbind_listener
(listener=listener@entry=0x5567ebd82a00,
do_close=do_close@entry=1) at src/listener.c:319
319 src/listener.c: No such file or directory.
(gdb) list
314 in src/listener.c
(gdb) backtrace
#0  0x5567eaf6aac2 in do_unbind_listener
(listener=listener@entry=0x5567ebd82a00,
do_close=do_close@entry=1) at src/listener.c:319
#1  0x5567eaf6b252 in unbind_listener
(listener=listener@entry=0x5567ebd82a00)
at src/listener.c:342
#2  0x5567eaf6b308 in unbind_all_listeners (proto=0x5567eb1fbcc0
) at src/listener.c:365
#3  0x5567eaf9bf0e in protocol_unbind_all () at src/protocol.c:76
#4  0x5567eaf397d6 in deinit () at src/haproxy.c:2289
#5  0x5567eaea1e1f in main (argc=, argv=)
at src/haproxy.c:3092

Haproxy 1.8.14 is installed from ubuntu repository

# cat vbernat-haproxy.list
deb http://ppa.launchpad.net/vbernat/haproxy-1.8/ubuntu trusty main
deb-src http://ppa.launchpad.net/vbernat/haproxy-1.8/ubuntu trusty main

Thanks

On Wed, Oct 17, 2018 at 9:26 AM Soji Antony  wrote:

> Hi Willy / Olivier,
>
> Thank you very much for the patch & detailed explanation. I will apply
> this patch on our servers.
>
> > Unfortunately, as is often the case with gdb, that's less than useful :/
> > If you have that available, you may install the haproxy-dbg package, but
> > I'm not convinced it will yield better results.
>
> PFA gdb.txt file which has 'thread apply all bt' & 'info threads' command
> output after installing haproxy-dbg package if that helps.
>
> > Can you share your config, obsucating any confidential informations, IP
> > addresses etc ?
>
> PFA haproxy.cfg file, I have added a comment [removed] wherever I have
> truncated lines.
>
> > You mentionned you where getting a segfault, do you know how to
> reproduce it ?
>
> Not sure how to reproduce it. I can see these segfault error messages in
> kernel logs on random servers.I can try enabling crash dump on one of the
> servers and share the details.
>
> Oct  9 16:16:35  kernel: [85669.521234] haproxy[59075]: segfault at
> 7fda1fb0fc60 ip 55c7273b643b sp 7fd8c2ffaab0 error 4 in
> haproxy[55c72734e000+172000]
> Oct 10 09:48:43  [148797.364018] haproxy[60048]: segfault at 8 ip
> 556ba5c7eac2 sp 7ffc5ef9e730 error 6 in haproxy[556ba5b89000+172000]
> Oct 11 14:30:56  kernel: [252130.055746] haproxy[4538]: segfault at
> 7fe088e87350 ip 5637ab43fea7 sp 7fe0857e8c20 error 4 in
> haproxy[5637ab41+172000]
> Oct 11 16:47:03 kernel: [260297.82] haproxy[74455]: segfault at
> 7f07d0de7290 ip 5574f96e1ea7 sp 7f07ce9c6c20 error 4 in
> haproxy[5574f96b2000+172000]
> Oct 11 22:06:19 : [279453.364729] haproxy[103724]: segfault at
> 7f7e492535d0 ip 55c8b4f1dea7 sp 7f7e46d93c20 error 4 in
> haproxy[55c8b4eee000+172000]
> Oct 13 04:31:14 : [388948.155673] haproxy[92338]: segfault at 8 ip
> 5583be079ac2 sp 7ffc6cb34e60 error 6 in haproxy[5583bdf84000+172000]
> Oct 15 15:17:04  kernel: [600498.581053] haproxy[63374]: segfault at 8 ip
> 55dd2e7d1ac2 sp 7ffed747e1d0 error 6 in haproxy[55dd2e6dc000+172000]
>
> > You also mentionned reloads are frequent, can you tell if the CPU spike
> happens
> immediately after a reload ?
>
> It is very difficult to say as the reloads are quite often. Attaching the
> graph for your reference. I can see that last reload happened at 16.56.19 &
> CPU spike usage started spiking at 16.57.30. But may be due to high CPU
> usage the script which we use to send this reload count to graphite might
> 

Re: High CPU Usage followed by segfault error

2018-10-16 Thread Olivier Houchard
On Tue, Oct 16, 2018 at 05:02:30PM +0200, Willy Tarreau wrote:
> On Tue, Oct 16, 2018 at 04:11:20PM +0200, Willy Tarreau wrote:
> > Could you please apply the attached patch ? I'm going to merge it into 1.9
> > and we'll backport it to 1.8 later.
> 
> And please add the attached one as well, which is specific to 1.8. I
> suspect that different versions of compiler could emit inappropriate
> code due to the threads_want_sync variable not being marked volatile.
> 
> In your case the issue would manifest itself if you're having heavy
> server queueing ("maxconn" on the server lines) or if you're seeing
> a lot of "server up/down" events.
> 
> Thanks,
> Willy


Nice catch !

This one is a good candidate.

Regards,

Olivier



Re: High CPU Usage followed by segfault error

2018-10-16 Thread Willy Tarreau
On Tue, Oct 16, 2018 at 04:11:20PM +0200, Willy Tarreau wrote:
> Could you please apply the attached patch ? I'm going to merge it into 1.9
> and we'll backport it to 1.8 later.

And please add the attached one as well, which is specific to 1.8. I
suspect that different versions of compiler could emit inappropriate
code due to the threads_want_sync variable not being marked volatile.

In your case the issue would manifest itself if you're having heavy
server queueing ("maxconn" on the server lines) or if you're seeing
a lot of "server up/down" events.

Thanks,
Willy
>From a655363b34a1f09625bb3f1550dc8b2c9b19b045 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Tue, 16 Oct 2018 16:57:40 +0200
Subject: BUG/MEDIUM: threads: make sure threads_want_sync is marked volatile

The threads_want_sync variable is not volatile, which allows the compiler
to cache old copies of it for long parts of code and possibly optimize
some tests away. This could result in deadlocks when using heavy queue
activity or health check state changes.

There is no upstream commit for this fix because the sync point was
completely removed from 1.9. This fix is exclusively for 1.8.

Signed-off-by: Willy Tarreau 
---
 src/hathreads.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/hathreads.c b/src/hathreads.c
index 9dba4356e..a96cc7b8f 100644
--- a/src/hathreads.c
+++ b/src/hathreads.c
@@ -29,7 +29,7 @@ void thread_sync_io_handler(int fd)
 
 static HA_SPINLOCK_T sync_lock;
 static int   threads_sync_pipe[2] = {-1, -1};
-static unsigned long threads_want_sync = 0;
+volatile static unsigned long threads_want_sync = 0;
 volatile unsigned long threads_want_rdv_mask = 0;
 volatile unsigned long threads_harmless_mask = 0;
 volatile unsigned long all_threads_mask  = 1; // nbthread 1 assumed by default
-- 
2.14.4



Re: High CPU Usage followed by segfault error

2018-10-16 Thread Willy Tarreau
Hi,

On Tue, Oct 16, 2018 at 09:28:46AM +0530, Soji Antony wrote:
> FYI, the initial version which we were using before upgrading to 1.8.14
> was 1.8.13.
> By mistake updated it as 1.8.3 in my first email.

No problem, thanks for the precision. After re-reading the code, I found a
bug, which is extremely difficult to trigger and which could induce the
situation you're running into. It requires a race between two "show fd"
on the CLI, where the second must start exactly when the first one stops.
I couldn't reproduce it since the window is too narrow but theorically it
exists, and the fact that you don't see it often could confirm this.

However since you're facing the same issue with 1.8.13 which uses the older
synchronization point, I still have some doubts about the root cause. But
we know the last one could be a bit tricky so we can't rule out a slightly
different issue giving overall the same visible effects.

Could you please apply the attached patch ? I'm going to merge it into 1.9
and we'll backport it to 1.8 later.

By the way, any reason you're running with SCHED_RR ? It might make things
worse during reloads by letting some threads spin on their own spinlocks
without offering a chance to the same thread of the other process to complete
its work.

Regards,
Willy
diff --git a/src/hathreads.c b/src/hathreads.c
index 9dba4356e..0a7c12f7a 100644
--- a/src/hathreads.c
+++ b/src/hathreads.c
@@ -221,12 +221,8 @@ void thread_isolate()
  */
 void thread_release()
 {
-   while (1) {
-   HA_ATOMIC_AND(_want_rdv_mask, ~tid_bit);
-   if (!(threads_want_rdv_mask & all_threads_mask))
-   break;
-   thread_harmless_till_end();
-   }
+   HA_ATOMIC_AND(_want_rdv_mask, ~tid_bit);
+   thread_harmless_end();
 }
 
 __attribute__((constructor))


Re: High CPU Usage followed by segfault error

2018-10-16 Thread Olivier Houchard
Hi Soji,

On Mon, Oct 15, 2018 at 11:10:09PM +0530, Soji Antony wrote:
> Hi Olivier,
> 
> Many thanks for your reply.
> Please find the gdb output given below.
> 
> # gdb /usr/sbin/haproxy core.dump3.13871
> GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
> Copyright (C) 2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later  >
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/sbin/haproxy...(no debugging symbols
> found)...done.
> [New LWP 13872]
> [New LWP 13873]
> [New LWP 13888]
> [New LWP 13889]
> [New LWP 13890]
> [New LWP 13892]
> [New LWP 13893]
> [New LWP 13894]
> [New LWP 13895]
> [New LWP 13871]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/sbin/haproxy'.
> #0  0x55b79086de43 in _start ()
> (gdb) thread apply all bt
> 
> Thread 10 (Thread 0x7f88ee327980 (LWP 13871)):
> #0  0x55b79086dee1 in _start ()
> 
> Thread 9 (Thread 0x7f88d37fe700 (LWP 13895)):
> #0  0x55b79086de46 in _start ()
> 
> Thread 8 (Thread 0x7f88cb7fe700 (LWP 13894)):
> #0  0x55b79086de46 in _start ()
> 
> Thread 7 (Thread 0x7f88d3fff700 (LWP 13893)):
> #0  0x55b79086de43 in _start ()
> 
> Thread 6 (Thread 0x7f88e8c25700 (LWP 13892)):
> #0  0x55b79086de43 in _start ()
> 
> Thread 5 (Thread 0x7f88e94c5700 (LWP 13890)):
> #0  0x55b79086de46 in _start ()
> 
> Thread 4 (Thread 0x7f88e9cc6700 (LWP 13889)):
> #0  0x55b79086de46 in _start ()
> 
> Thread 3 (Thread 0x7f88ea4c7700 (LWP 13888)):
> #0  0x55b79086de43 in _start ()
> 
> Thread 2 (Thread 0x7f88eacc8700 (LWP 13873)):
> #0  0x55b79086de43 in _start ()
> 
> Thread 1 (Thread 0x7f88eb4c9700 (LWP 13872)):
> #0  0x55b79086de43 in _start ()
> (gdb) info threads
>   Id   Target Id Frame
>   10   Thread 0x7f88ee327980 (LWP 13871) 0x55b79086dee1 in _start ()
>   9Thread 0x7f88d37fe700 (LWP 13895) 0x55b79086de46 in _start ()
>   8Thread 0x7f88cb7fe700 (LWP 13894) 0x55b79086de46 in _start ()
>   7Thread 0x7f88d3fff700 (LWP 13893) 0x55b79086de43 in _start ()
>   6Thread 0x7f88e8c25700 (LWP 13892) 0x55b79086de43 in _start ()
>   5Thread 0x7f88e94c5700 (LWP 13890) 0x55b79086de46 in _start ()
>   4Thread 0x7f88e9cc6700 (LWP 13889) 0x55b79086de46 in _start ()
>   3Thread 0x7f88ea4c7700 (LWP 13888) 0x55b79086de43 in _start ()
>   2Thread 0x7f88eacc8700 (LWP 13873) 0x55b79086de43 in _start ()
> * 1Thread 0x7f88eb4c9700 (LWP 13872) 0x55b79086de43 in _start ()
> (gdb)
> 

Unfortunately, as is often the case with gdb, that's less than useful :/
If you have that available, you may install the haproxy-dbg package, but
I'm not convinced it will yield better results.
Can you share your config, obsucating any confidential informations, IP
addresses etc ?
Do you do anything on a stat sockets ?
You mentionned you where getting a segfault, do you know how to reproduce it ?
You also mentionned reloads are frequent, can you tell if the CPU spike happens
immediately after a reload ?

Thanks !

Olivier



Re: High CPU Usage followed by segfault error

2018-10-15 Thread Soji Antony
FYI, the initial version which we were using before upgrading to 1.8.14
was 1.8.13.
By mistake updated it as 1.8.3 in my first email.

Thanks

On Mon, Oct 15, 2018 at 11:10 PM Soji Antony  wrote:

> Hi Olivier,
>
> Many thanks for your reply.
> Please find the gdb output given below.
>
> # gdb /usr/sbin/haproxy core.dump3.13871
> GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
> Copyright (C) 2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/sbin/haproxy...(no debugging symbols
> found)...done.
> [New LWP 13872]
> [New LWP 13873]
> [New LWP 13888]
> [New LWP 13889]
> [New LWP 13890]
> [New LWP 13892]
> [New LWP 13893]
> [New LWP 13894]
> [New LWP 13895]
> [New LWP 13871]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/sbin/haproxy'.
> #0  0x55b79086de43 in _start ()
> (gdb) thread apply all bt
>
> Thread 10 (Thread 0x7f88ee327980 (LWP 13871)):
> #0  0x55b79086dee1 in _start ()
>
> Thread 9 (Thread 0x7f88d37fe700 (LWP 13895)):
> #0  0x55b79086de46 in _start ()
>
> Thread 8 (Thread 0x7f88cb7fe700 (LWP 13894)):
> #0  0x55b79086de46 in _start ()
>
> Thread 7 (Thread 0x7f88d3fff700 (LWP 13893)):
> #0  0x55b79086de43 in _start ()
>
> Thread 6 (Thread 0x7f88e8c25700 (LWP 13892)):
> #0  0x55b79086de43 in _start ()
>
> Thread 5 (Thread 0x7f88e94c5700 (LWP 13890)):
> #0  0x55b79086de46 in _start ()
>
> Thread 4 (Thread 0x7f88e9cc6700 (LWP 13889)):
> #0  0x55b79086de46 in _start ()
>
> Thread 3 (Thread 0x7f88ea4c7700 (LWP 13888)):
> #0  0x55b79086de43 in _start ()
>
> Thread 2 (Thread 0x7f88eacc8700 (LWP 13873)):
> #0  0x55b79086de43 in _start ()
>
> Thread 1 (Thread 0x7f88eb4c9700 (LWP 13872)):
> #0  0x55b79086de43 in _start ()
> (gdb) info threads
>   Id   Target Id Frame
>   10   Thread 0x7f88ee327980 (LWP 13871) 0x55b79086dee1 in _start ()
>   9Thread 0x7f88d37fe700 (LWP 13895) 0x55b79086de46 in _start ()
>   8Thread 0x7f88cb7fe700 (LWP 13894) 0x55b79086de46 in _start ()
>   7Thread 0x7f88d3fff700 (LWP 13893) 0x55b79086de43 in _start ()
>   6Thread 0x7f88e8c25700 (LWP 13892) 0x55b79086de43 in _start ()
>   5Thread 0x7f88e94c5700 (LWP 13890) 0x55b79086de46 in _start ()
>   4Thread 0x7f88e9cc6700 (LWP 13889) 0x55b79086de46 in _start ()
>   3Thread 0x7f88ea4c7700 (LWP 13888) 0x55b79086de43 in _start ()
>   2Thread 0x7f88eacc8700 (LWP 13873) 0x55b79086de43 in _start ()
> * 1Thread 0x7f88eb4c9700 (LWP 13872) 0x55b79086de43 in _start ()
> (gdb)
>
> On Mon, Oct 15, 2018 at 4:58 PM Olivier Houchard 
> wrote:
>
>> Hi,
>>
>> On Sat, Oct 13, 2018 at 01:34:53PM +0530, Soji Antony wrote:
>> > Really appreciate if some one can help with this issue. This is
>> happening
>> > quite frequently on our servers. I have taken a coredump when this
>> happened
>> > last time using 'gcore' command. However unable to share it as it might
>> > have ssl related information. Is there anyway I can remove confidential
>> > information from this coredump file before sharing it. Unfortunately we
>> are
>> > not able to reproduce this issue in a test environment. Also CPU usage
>> was
>> > 100% for all the cores which haproxy was using for around 4hrs though we
>> > have taken it out of rotation from DNS & was not serving any traffic.
>> For
>> > me it looks like something internal to haproxy is causing an infinite
>> loop
>> > causing heavy cpu usage.
>> >
>> > Thanks
>> >
>>
>> Sorry for the late answer.
>> Without sending us the core, if you could at least give use the gdb output
>> of "thread apply all bt" so that we know where each thread is spinning,
>> that
>> may be very useful, and shouldn't leak any confidential information.
>>
>> Thanks !
>>
>> Olivier
>>
>> > [image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
>> >
>> > On Fri, Oct 12, 2018 at 12:01 PM Soji Antony 
>> wrote:
>> >
>> > > Hi Oliver,
>> > >
>> > > Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but
>> seeing
>> > > the same CPU issue again.
>> > > I have found that the segmentation fault which we were seeing earlier
>> is
>> > > not related to the CPU spike as it is happening at different time.
>> Recently
>> > > we had the same issue with one of our haproxy 

Re: High CPU Usage followed by segfault error

2018-10-15 Thread Soji Antony
Hi Olivier,

Many thanks for your reply.
Please find the gdb output given below.

# gdb /usr/sbin/haproxy core.dump3.13871
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/haproxy...(no debugging symbols
found)...done.
[New LWP 13872]
[New LWP 13873]
[New LWP 13888]
[New LWP 13889]
[New LWP 13890]
[New LWP 13892]
[New LWP 13893]
[New LWP 13894]
[New LWP 13895]
[New LWP 13871]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy'.
#0  0x55b79086de43 in _start ()
(gdb) thread apply all bt

Thread 10 (Thread 0x7f88ee327980 (LWP 13871)):
#0  0x55b79086dee1 in _start ()

Thread 9 (Thread 0x7f88d37fe700 (LWP 13895)):
#0  0x55b79086de46 in _start ()

Thread 8 (Thread 0x7f88cb7fe700 (LWP 13894)):
#0  0x55b79086de46 in _start ()

Thread 7 (Thread 0x7f88d3fff700 (LWP 13893)):
#0  0x55b79086de43 in _start ()

Thread 6 (Thread 0x7f88e8c25700 (LWP 13892)):
#0  0x55b79086de43 in _start ()

Thread 5 (Thread 0x7f88e94c5700 (LWP 13890)):
#0  0x55b79086de46 in _start ()

Thread 4 (Thread 0x7f88e9cc6700 (LWP 13889)):
#0  0x55b79086de46 in _start ()

Thread 3 (Thread 0x7f88ea4c7700 (LWP 13888)):
#0  0x55b79086de43 in _start ()

Thread 2 (Thread 0x7f88eacc8700 (LWP 13873)):
#0  0x55b79086de43 in _start ()

Thread 1 (Thread 0x7f88eb4c9700 (LWP 13872)):
#0  0x55b79086de43 in _start ()
(gdb) info threads
  Id   Target Id Frame
  10   Thread 0x7f88ee327980 (LWP 13871) 0x55b79086dee1 in _start ()
  9Thread 0x7f88d37fe700 (LWP 13895) 0x55b79086de46 in _start ()
  8Thread 0x7f88cb7fe700 (LWP 13894) 0x55b79086de46 in _start ()
  7Thread 0x7f88d3fff700 (LWP 13893) 0x55b79086de43 in _start ()
  6Thread 0x7f88e8c25700 (LWP 13892) 0x55b79086de43 in _start ()
  5Thread 0x7f88e94c5700 (LWP 13890) 0x55b79086de46 in _start ()
  4Thread 0x7f88e9cc6700 (LWP 13889) 0x55b79086de46 in _start ()
  3Thread 0x7f88ea4c7700 (LWP 13888) 0x55b79086de43 in _start ()
  2Thread 0x7f88eacc8700 (LWP 13873) 0x55b79086de43 in _start ()
* 1Thread 0x7f88eb4c9700 (LWP 13872) 0x55b79086de43 in _start ()
(gdb)

On Mon, Oct 15, 2018 at 4:58 PM Olivier Houchard 
wrote:

> Hi,
>
> On Sat, Oct 13, 2018 at 01:34:53PM +0530, Soji Antony wrote:
> > Really appreciate if some one can help with this issue. This is happening
> > quite frequently on our servers. I have taken a coredump when this
> happened
> > last time using 'gcore' command. However unable to share it as it might
> > have ssl related information. Is there anyway I can remove confidential
> > information from this coredump file before sharing it. Unfortunately we
> are
> > not able to reproduce this issue in a test environment. Also CPU usage
> was
> > 100% for all the cores which haproxy was using for around 4hrs though we
> > have taken it out of rotation from DNS & was not serving any traffic. For
> > me it looks like something internal to haproxy is causing an infinite
> loop
> > causing heavy cpu usage.
> >
> > Thanks
> >
>
> Sorry for the late answer.
> Without sending us the core, if you could at least give use the gdb output
> of "thread apply all bt" so that we know where each thread is spinning,
> that
> may be very useful, and shouldn't leak any confidential information.
>
> Thanks !
>
> Olivier
>
> > [image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
> >
> > On Fri, Oct 12, 2018 at 12:01 PM Soji Antony 
> wrote:
> >
> > > Hi Oliver,
> > >
> > > Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but
> seeing
> > > the same CPU issue again.
> > > I have found that the segmentation fault which we were seeing earlier
> is
> > > not related to the CPU spike as it is happening at different time.
> Recently
> > > we had the same issue with one of our haproxy servers and found the
> > > following in strace o/p:
> > >
> > > # haproxy -vv
> > >
> > > HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
> > > Copyright 2000-2018 Willy Tarreau 
> > >
> > > Build options :
> > >   TARGET  = linux2628
> > >   CPU = generic
> > >   CC  = gcc
> > >   CFLAGS  = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
> > > -Wformat -Werror=format-security 

Re: High CPU Usage followed by segfault error

2018-10-15 Thread Olivier Houchard
Hi,

On Sat, Oct 13, 2018 at 01:34:53PM +0530, Soji Antony wrote:
> Really appreciate if some one can help with this issue. This is happening
> quite frequently on our servers. I have taken a coredump when this happened
> last time using 'gcore' command. However unable to share it as it might
> have ssl related information. Is there anyway I can remove confidential
> information from this coredump file before sharing it. Unfortunately we are
> not able to reproduce this issue in a test environment. Also CPU usage was
> 100% for all the cores which haproxy was using for around 4hrs though we
> have taken it out of rotation from DNS & was not serving any traffic. For
> me it looks like something internal to haproxy is causing an infinite loop
> causing heavy cpu usage.
> 
> Thanks
> 

Sorry for the late answer.
Without sending us the core, if you could at least give use the gdb output
of "thread apply all bt" so that we know where each thread is spinning, that
may be very useful, and shouldn't leak any confidential information.

Thanks !

Olivier

> [image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
> 
> On Fri, Oct 12, 2018 at 12:01 PM Soji Antony  wrote:
> 
> > Hi Oliver,
> >
> > Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
> > the same CPU issue again.
> > I have found that the segmentation fault which we were seeing earlier is
> > not related to the CPU spike as it is happening at different time. Recently
> > we had the same issue with one of our haproxy servers and found the
> > following in strace o/p:
> >
> > # haproxy -vv
> >
> > HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
> > Copyright 2000-2018 Willy Tarreau 
> >
> > Build options :
> >   TARGET  = linux2628
> >   CPU = generic
> >   CC  = gcc
> >   CFLAGS  = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
> > -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
> >   OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
> > USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
> >
> > Default settings :
> >   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
> >
> > Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
> > Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
> > OpenSSL library supports TLS extensions : yes
> > OpenSSL library supports SNI : yes
> > Built with Lua version : Lua 5.3.1
> > Built with transparent proxy support using: IP_TRANSPARENT
> > IPV6_TRANSPARENT IP_FREEBIND
> > Encrypted password support via crypt(3): yes
> > Built with multi-threading support.
> > Built with PCRE version : 8.31 2012-07-06
> > Running on PCRE version : 8.31 2012-07-06
> > PCRE library supports JIT : no (libpcre build without JIT?)
> > Built with zlib version : 1.2.8
> > Running on zlib version : 1.2.8
> > Compression algorithms supported : identity("identity"),
> > deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
> > Built with network namespace support.
> >
> > Available polling systems :
> >   epoll : pref=300,  test result OK
> >poll : pref=200,  test result OK
> >  select : pref=150,  test result OK
> > Total: 3 (3 usable), will use epoll.
> >
> > Available filters :
> > [SPOE] spoe
> > [COMP] compression
> > [TRACE] trace
> >
> > Strace O/P:
> >
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( 
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( 
> > [pid 114265] sched_yield( 
> > [pid 114267] sched_yield( 
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] <... sched_yield resumed> ) = 

Re: High CPU Usage followed by segfault error

2018-10-12 Thread Soji Antony
Hi Oliver,

Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier is
not related to the CPU spike as it is happening at different time. Recently
we had the same issue with one of our haproxy servers and found the
following in strace o/p:

# haproxy -vv

HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
Copyright 2000-2018 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

Strace O/P:

[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( 
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] sched_yield( 
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] sched_yield( 
[pid 114267] sched_yield( 
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( 
[pid 114265] sched_yield( 

kernel.log

Oct 10 19:13:04 int16 kernel: [192997.62] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task :1213
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250]   Tainted: G   OE
  
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389]  880768c878a8
880768c87968 880766ae3700 880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392]  
7fff 88078ffd50f0 817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394]  880768c878c0
817f0fb5 88076f917200 880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152396] Call Trace:
Oct 10 19:16:28 int16 kernel: [193201.152402]  [] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404]  []
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418]  []
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430]  [] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438]  [] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441]  [] ?

Re: High CPU Usage followed by segfault error

2018-10-02 Thread Olivier Houchard
Hi,

On Tue, Oct 02, 2018 at 08:26:12PM +0530, Soji Antony wrote:
> Hello,
> 
> We are currently using haproxy 1.8.3 with single process multithreaded
> configuration.
> We have 1 process and 10 threads each mapped to a separate core [0-9]. We
> are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
> only other CPU intensive process running on this server is a log shipper
> which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
> command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.
> 
> OS: Ubuntu 14
> Kernel: 4.4.0-134-generic
> 
> The issue we are seeing with Haproxy is all of a sudden CPU usage spikes to
> 100% on cores which haproxy is using & causing latency spikes and high load
> on the server. We are seeing the following error messages in system /
> kernel logs when this issue happens.
> 
> haproxy[92558]: segfault at 8 ip 55f04b1f5da2 sp 7ffdab2bdd40 error
> 6 in haproxy[55f04b10100
> 0+17]
> 
> Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
> throttling activated
> 
> We are using marathonlb for auto discovery and reloads are quite frequent
> on this server. Last time when this issue happened we had seen haproxy
> using 750% of CPU and it went into D state. Also the old process was also
> taking cpu.
> 
> hard-stop-after was not set in our hap configuration and we were seeing
> multiple old pid's running on the server. After the last outage we had with
> CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple hap
> instances running after reload. I would really appreciate if some one can
> explain us why the CPU usage spikes with the above segfault error & what
> this error exactly means.
> 
> FYI: There was no traffic spike on this hap instance when the issue
> happened. We have even seen the same issue in a non-prod hap where no
> traffic was coming & system went down due to CPU usage & found the same
> segfault error in the logs.
> 

A good first step would probably to upgrade to the latest version if possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.

Regards,

Olivier



High CPU Usage followed by segfault error

2018-10-02 Thread Soji Antony
Hello,

We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9]. We
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.

OS: Ubuntu 14
Kernel: 4.4.0-134-generic

The issue we are seeing with Haproxy is all of a sudden CPU usage spikes to
100% on cores which haproxy is using & causing latency spikes and high load
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.

haproxy[92558]: segfault at 8 ip 55f04b1f5da2 sp 7ffdab2bdd40 error
6 in haproxy[55f04b10100
0+17]

Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated

We are using marathonlb for auto discovery and reloads are quite frequent
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was also
taking cpu.

hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had with
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple hap
instances running after reload. I would really appreciate if some one can
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.

FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.

Thanks

Thanks