Re: High CPU Usage followed by segfault error

Olivier Houchard Mon, 15 Oct 2018 04:28:43 -0700

Hi,

On Sat, Oct 13, 2018 at 01:34:53PM +0530, Soji Antony wrote:
> Really appreciate if some one can help with this issue. This is happening
> quite frequently on our servers. I have taken a coredump when this happened
> last time using 'gcore' command. However unable to share it as it might
> have ssl related information. Is there anyway I can remove confidential
> information from this coredump file before sharing it. Unfortunately we are
> not able to reproduce this issue in a test environment. Also CPU usage was
> 100% for all the cores which haproxy was using for around 4hrs though we
> have taken it out of rotation from DNS & was not serving any traffic. For
> me it looks like something internal to haproxy is causing an infinite loop
> causing heavy cpu usage.
> 
> Thanks
>


Sorry for the late answer.
Without sending us the core, if you could at least give use the gdb output
of "thread apply all bt" so that we know where each thread is spinning, that
may be very useful, and shouldn't leak any confidential information.

Thanks !

Olivier

> [image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
> 
> On Fri, Oct 12, 2018 at 12:01 PM Soji Antony <[email protected]> wrote:
> 
> > Hi Oliver,
> >
> > Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
> > the same CPU issue again.
> > I have found that the segmentation fault which we were seeing earlier is
> > not related to the CPU spike as it is happening at different time. Recently
> > we had the same issue with one of our haproxy servers and found the
> > following in strace o/p:
> >
> > # haproxy -vv
> >
> > HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
> > Copyright 2000-2018 Willy Tarreau <[email protected]>
> >
> > Build options :
> >   TARGET  = linux2628
> >   CPU     = generic
> >   CC      = gcc
> >   CFLAGS  = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
> > -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
> >   OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
> > USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
> >
> > Default settings :
> >   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
> >
> > Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
> > Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
> > OpenSSL library supports TLS extensions : yes
> > OpenSSL library supports SNI : yes
> > Built with Lua version : Lua 5.3.1
> > Built with transparent proxy support using: IP_TRANSPARENT
> > IPV6_TRANSPARENT IP_FREEBIND
> > Encrypted password support via crypt(3): yes
> > Built with multi-threading support.
> > Built with PCRE version : 8.31 2012-07-06
> > Running on PCRE version : 8.31 2012-07-06
> > PCRE library supports JIT : no (libpcre build without JIT?)
> > Built with zlib version : 1.2.8
> > Running on zlib version : 1.2.8
> > Compression algorithms supported : identity("identity"),
> > deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
> > Built with network namespace support.
> >
> > Available polling systems :
> >       epoll : pref=300,  test result OK
> >        poll : pref=200,  test result OK
> >      select : pref=150,  test result OK
> > Total: 3 (3 usable), will use epoll.
> >
> > Available filters :
> > [SPOE] spoe
> > [COMP] compression
> > [TRACE] trace
> >
> > Strace O/P:
> >
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] sched_yield( <unfinished ...>
> > [pid 114267] sched_yield( <unfinished ...>
> > [pid 114266] <... sched_yield resumed> ) = 0
> > [pid 114265] <... sched_yield resumed> ) = 0
> > [pid 114267] <... sched_yield resumed> ) = 0
> > [pid 114266] sched_yield( <unfinished ...>
> > [pid 114265] sched_yield( <unfinished ...>
> >
> > kernel.log
> >
> > Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling
> > activated
> > Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task <logshipper>:1213
> > blocked for more than 120 seconds.
> > Oct 10 19:16:28 int16 kernel: [193201.144250]       Tainted: G
> > OE   <kernel-version>
> > Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 10 19:16:28 int16 kernel: [193201.152389]  ffff880768c878a8
> > ffff880768c87968 ffff880766ae3700 ffff880768c88000
> > Oct 10 19:16:28 int16 kernel: [193201.152392]  0000000000000000
> > 7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
> > Oct 10 19:16:28 int16 kernel: [193201.152394]  ffff880768c878c0
> > ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
> > Oct 10 19:16:28 int16 kernel: [193201.152396] Call Trace:
> > Oct 10 19:16:28 int16 kernel: [193201.152402]  [<ffffffff817f1700>] ?
> > bit_wait+0x50/0x50
> > Oct 10 19:16:28 int16 kernel: [193201.152404]  [<ffffffff817f0fb5>]
> > schedule+0x35/0x80
> > Oct 10 19:16:28 int16 kernel: [193201.152418]  [<ffffffff817f39ab>]
> > schedule_timeout+0x23b/0x2d0
> > Oct 10 19:16:28 int16 kernel: [193201.152430]  [<ffffffff81022d35>] ?
> > xen_clocksource_read+0x15/0x20
> > Oct 10 19:16:28 int16 kernel: [193201.152438]  [<ffffffff81036ab9>] ?
> > sched_clock+0x9/0x10
> > Oct 10 19:16:28 int16 kernel: [193201.152441]  [<ffffffff813bf154>] ?
> > __blk_mq_alloc_request+0xe4/0x1f0
> > Oct 10 19:16:28 int16 kernel: [193201.152442]  [<ffffffff817f1700>] ?
> > bit_wait+0x50/0x50
> > Oct 10 19:16:28 int16 kernel: [193201.152445]  [<ffffffff817f06f6>]
> > io_schedule_timeout+0xa6/0x110
> > Oct 10 19:16:28 int16 kernel: [193201.152446]  [<ffffffff817f171b>]
> > bit_wait_io+0x1b/0x60
> > Oct 10 19:16:28 int16 kernel: [193201.152448]  [<ffffffff817f1362>]
> > __wait_on_bit+0x62/0x90
> > Oct 10 19:16:28 int16 kernel: [193201.152451]  [<ffffffff81187bc0>]
> > wait_on_page_bit+0xc0/0xd0
> > Oct 10 19:16:28 int16 kernel: [193201.152454]  [<ffffffff810c27f0>] ?
> > autoremove_wake_function+0x40/0x40
> > Oct 10 19:16:28 int16 kernel: [193201.152456]  [<ffffffff81197e66>]
> > truncate_inode_pages_range+0x366/0x6d0
> > Oct 10 19:16:28 int16 kernel: [193201.152459]  [<ffffffff81554398>] ?
> > blkif_queue_rq+0x508/0x690
> > Oct 10 19:16:28 int16 kernel: [193201.152461]  [<ffffffff817f2d82>] ?
> > down_write+0x12/0x40
> > Oct 10 19:16:28 int16 kernel: [193201.152465]  [<ffffffff811b6646>] ?
> > unmap_mapping_range+0x66/0x110
> > Oct 10 19:16:28 int16 kernel: [193201.152467]  [<ffffffff81198297>]
> > truncate_pagecache+0x47/0x60
> > Oct 10 19:16:28 int16 kernel: [193201.152469]  [<ffffffff811982e2>]
> > truncate_setsize+0x32/0x40
> > Oct 10 19:16:28 int16 kernel: [193201.152509]  [<ffffffffc014d5d8>]
> > xfs_setattr_size+0x168/0x3d0 [xfs]
> > Oct 10 19:16:28 int16 kernel: [193201.152522]  [<ffffffffc014d8df>]
> > xfs_vn_setattr+0x9f/0xb0 [xfs]
> > Oct 10 19:16:28 int16 kernel: [193201.152524]  [<ffffffff81221e80>]
> > notify_change+0x250/0x470
> > Oct 10 19:16:28 int16 kernel: [193201.152533]  [<ffffffff81202a86>]
> > do_truncate+0x66/0xa0
> > Oct 10 19:16:28 int16 kernel: [193201.152545]  [<ffffffff81212d06>]
> > path_openat+0x266/0x12e0
> > Oct 10 19:16:28 int16 kernel: [193201.152555]  [<ffffffffc02f80ba>] ?
> > record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
> > Oct 10 19:16:28 int16 kernel: [193201.152564]  [<ffffffff8118a6f5>] ?
> > generic_file_read_iter+0x5d5/0x670
> > Oct 10 19:16:28 int16 kernel: [193201.152566]  [<ffffffff812159ee>]
> > do_filp_open+0x7e/0xd0
> > Oct 10 19:16:28 int16 kernel: [193201.152569]  [<ffffffff81222f84>] ?
> > __alloc_fd+0xc4/0x180
> > Oct 10 19:16:28 int16 kernel: [193201.152571]  [<ffffffff81204b19>]
> > do_sys_open+0x129/0x270
> > Oct 10 19:16:28 int16 kernel: [193201.152574]  [<ffffffffc02f8aef>] ?
> > syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
> > Oct 10 19:16:28 int16 kernel: [193201.152576]  [<ffffffff81204c94>]
> > SyS_openat+0x14/0x20
> > Oct 10 19:16:28 int16 kernel: [193201.152578]  [<ffffffff817f4a12>]
> > tracesys_phase2+0x94/0x99
> >
> > FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR , priority 99.
> >
> > Thanks
> >
> > On Tue, Oct 2, 2018 at 10:09 PM Olivier Houchard <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> On Tue, Oct 02, 2018 at 08:26:12PM +0530, Soji Antony wrote:
> >> > Hello,
> >> >
> >> > We are currently using haproxy 1.8.3 with single process multithreaded
> >> > configuration.
> >> > We have 1 process and 10 threads each mapped to a separate core [0-9].
> >> We
> >> > are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
> >> > only other CPU intensive process running on this server is a log shipper
> >> > which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
> >> > command. Also we have given 'SCHED_RR' priority 99 for haproxy
> >> processes.
> >> >
> >> > OS: Ubuntu 14
> >> > Kernel: 4.4.0-134-generic
> >> >
> >> > The issue we are seeing with Haproxy is all of a sudden CPU usage
> >> spikes to
> >> > 100% on cores which haproxy is using & causing latency spikes and high
> >> load
> >> > on the server. We are seeing the following error messages in system /
> >> > kernel logs when this issue happens.
> >> >
> >> > haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40
> >> error
> >> > 6 in haproxy[55f04b10100
> >> > 0+170000]
> >> >
> >> > Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
> >> > throttling activated
> >> >
> >> > We are using marathonlb for auto discovery and reloads are quite
> >> frequent
> >> > on this server. Last time when this issue happened we had seen haproxy
> >> > using 750% of CPU and it went into D state. Also the old process was
> >> also
> >> > taking cpu.
> >> >
> >> > hard-stop-after was not set in our hap configuration and we were seeing
> >> > multiple old pid's running on the server. After the last outage we had
> >> with
> >> > CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple
> >> hap
> >> > instances running after reload. I would really appreciate if some one
> >> can
> >> > explain us why the CPU usage spikes with the above segfault error & what
> >> > this error exactly means.
> >> >
> >> > FYI: There was no traffic spike on this hap instance when the issue
> >> > happened. We have even seen the same issue in a non-prod hap where no
> >> > traffic was coming & system went down due to CPU usage & found the same
> >> > segfault error in the logs.
> >> >
> >>
> >> A good first step would probably to upgrade to the latest version if
> >> possible.
> >> 1.8.3 is quite old, and a bunch of bugs have been fixed since then,
> >> especially when using multithreading.
> >>
> >> Regards,
> >>
> >> Olivier
> >>
> >

Re: High CPU Usage followed by segfault error

Reply via email to