Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
I've also encountered this issue on a cluster yesterday; one CPU got stuck in an infinite loop in get_obj_data::flush and it stopped serving requests. I've updated the tracker issue accordingly. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik wrote: > > Hello > > I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, > radosgw process on those machines starts consuming 100% of 5 CPU cores > for days at a time, even though the machine is not being used for data > transfers (nothing in radosgw logs, couple of KB/s of network). > > This situation can affect any number of our rados gateways, lasts from > few hours to few days and stops if radosgw process is restarted or on > its own. > > Does anybody have an idea what might be going on or how to debug it? I > don't see anything obvious in the logs. Perf top is saying that CPU is > consumed by radosgw shared object in symbol get_obj_data::flush, which, > if I interpret things correctly, is called from a symbol with a long > name that contains the substring "boost9intrusive9list_impl" > > This is our configuration: > rgw_frontends = civetweb num_threads=5000 port=443s > ssl_certificate=/etc/ceph/rgw.crt > error_log_file=/var/log/ceph/civetweb.error.log > > (error log file doesn't exist) > > > Thanks, > > Vlad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
I created a ticket: https://tracker.ceph.com/issues/41511 Note that I think I was mistaken when I said that sometimes the problem goes away on its own. I've looked back through our monitoring and it looks like when the problem did go away, it was because either the machine was rebooted or the radosgw service was restarted. Vlad On 8/23/19 10:17 AM, Eric Ivancich wrote: Good morning, Vladimir, Please create a tracker for this (https://tracker.ceph.com/projects/rgw/issues/new) and include the link to it in an email reply. And if you can include any more potentially relevant details, please do so. I’ll add my initial analysis to it. But the threads do seem to be stuck, at least for a while, in get_obj_data::flush despite a lack of traffic. And sometimes it self-resolves, so it’s not a true “infinite loop”. Thank you, Eric On Aug 22, 2019, at 9:12 PM, Eric Ivancich <mailto:ivanc...@redhat.com>> wrote: Thank you for providing the profiling data, Vladimir. There are 5078 threads and most of them are waiting. Here is a list of the deepest call of each thread with duplicates removed. + 100.00% epoll_wait + 100.00% get_obj_data::flush(rgw::OwningList&&) + 100.00% poll + 100.00% poll + 100.00% poll + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 + 100.00% pthread_cond_wait@@GLIBC_2.3.2 + 100.00% pthread_cond_wait@@GLIBC_2.3.2 + 100.00% read + 100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ The only interesting ones are the second and last: * get_obj_data::flush(rgw::OwningList&&) * _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ They are essentially part of the same call stack that results from processing a GetObj request, and five threads are in this call stack (the only difference is wether or not they include the call into boost intrusive list). Here’s the full call stack of those threads: + 100.00% clone + 100.00% start_thread + 100.00% worker_thread + 100.00% process_new_connection + 100.00% handle_request + 100.00% RGWCivetWebFrontend::process(mg_connection*) + 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, opt ional_yield, rgw::dmclock::Scheduler*, int*) + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool) + 100.00% RGWGetObj::execute() + 100.00% RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*) + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int (*)(rgw_raw_obj const&, l ong, long, long, bool, RGWObjState*, void*), void*) + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) + 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) + 100.00% get_obj_data::flush(rgw::OwningList&&) + 100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ So this isn’t background processing but request processing. I’m not clear why these requests are consuming so much CPU for so long. From your initial message: I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. I’m going to check with others who’re more familiar with this code path. Begin forwarded message: *From:*Vladimir Brik <mailto:vladimir.b...@icecube.wisc.edu>> *Subject:**Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred* *Date:*August 21, 2019 at 4:47:01 PM EDT *To:*"J. Eric Ivancich" <mailto:ivanc...@redhat.com>>, Mark Nelson <mailto:mnel...@redhat.com>>,ceph-users@lists.ceph.com <mailto:ceph-users@lists.
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Good morning, Vladimir, Please create a tracker for this (https://tracker.ceph.com/projects/rgw/issues/new <https://tracker.ceph.com/projects/rgw/issues/new>) and include the link to it in an email reply. And if you can include any more potentially relevant details, please do so. I’ll add my initial analysis to it. But the threads do seem to be stuck, at least for a while, in get_obj_data::flush despite a lack of traffic. And sometimes it self-resolves, so it’s not a true “infinite loop”. Thank you, Eric > On Aug 22, 2019, at 9:12 PM, Eric Ivancich wrote: > > Thank you for providing the profiling data, Vladimir. There are 5078 threads > and most of them are waiting. Here is a list of the deepest call of each > thread with duplicates removed. > > + 100.00% epoll_wait > + 100.00% > get_obj_data::flush(rgw::OwningList&&) > + 100.00% poll > + 100.00% poll > + 100.00% poll > + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 > + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 > + 100.00% pthread_cond_wait@@GLIBC_2.3.2 > + 100.00% pthread_cond_wait@@GLIBC_2.3.2 > + 100.00% read > + 100.00% > _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ > > The only interesting ones are the second and last: > > * get_obj_data::flush(rgw::OwningList&&) > * > _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ > > They are essentially part of the same call stack that results from processing > a GetObj request, and five threads are in this call stack (the only > difference is wether or not they include the call into boost intrusive list). > Here’s the full call stack of those threads: > > + 100.00% clone > + 100.00% start_thread > + 100.00% worker_thread > + 100.00% process_new_connection > + 100.00% handle_request > + 100.00% RGWCivetWebFrontend::process(mg_connection*) > + 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, > std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, > OpsLogSocket*, opt > ional_yield, rgw::dmclock::Scheduler*, int*) > + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, > RGWRequest*, req_state*, bool) > + 100.00% RGWGetObj::execute() > + 100.00% RGWRados::Object::Read::iterate(long, long, > RGWGetDataCB*) > + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, > RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int > (*)(rgw_raw_obj const&, l > ong, long, long, bool, RGWObjState*, void*), void*) > + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, > long, long, bool, RGWObjState*, void*) > + 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj > const&, long, long, long, bool, RGWObjState*, void*) > + 100.00% > get_obj_data::flush(rgw::OwningList&&) > + 100.00% > _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ > > So this isn’t background processing but request processing. I’m not clear why > these requests are consuming so much CPU for so long. > > From your initial message: >> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, >> radosgw process on those machines starts consuming 100% of 5 CPU cores for >> days at a time, even though the machine is not being used for data transfers >> (nothing in radosgw logs, couple of KB/s of network). >> >> This situation can affect any number of our rados gateways, lasts from few >> hours to few days and stops if radosgw process is restarted or on its own. > > > I’m going to check with others who’re more familiar with this code path. > >> Begin forwarded message: >> >> From: Vladimir Brik > <mailto:vladimir.b...@icecube.wisc.edu>> >> Subject: Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is >> being transferred >> Date: August 21, 2019 at 4:47:01 PM EDT >> To: "J. Eric Ivancich" mailto:ivanc...@redhat.com>>, >> Mark Nelson mailto:mn
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
> Are you running multisite? No > Do you have dynamic bucket resharding turned on? Yes. "radosgw-admin reshard list" prints "[]" > Are you using lifecycle? I am not sure. How can I check? "radosgw-admin lc list" says "[]" > And just to be clear -- sometimes all 3 of your rados gateways are > simultaneously in this state? Multiple, but I have not seen all 3 being in this state simultaneously. Currently one gateway has 1 thread using 100% of CPU, and another has 5 threads each using 100% CPU. Here are the fruits of my attempts to capture the call graph using perf and gdbpmp: https://icecube.wisc.edu/~vbrik/perf.data https://icecube.wisc.edu/~vbrik/gdbpmp.data These are the commands that I ran and their outputs (note I couldn't get perf not to generate the warning): rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data Attaching to process 73688...Done. Gathering Samples Profiling complete with 100 samples. rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10 [ perf record: Woken up 54 times to write data ] Warning: Processed 574207 events and lost 4 chunks! Check IO/CPU overload! [ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ] Vlad On 8/21/19 11:16 AM, J. Eric Ivancich wrote: On 8/21/19 10:22 AM, Mark Nelson wrote: Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello [much elided] You might want to try grabbing a a callgraph from perf instead of just running perf top or using my wallclock profiler to see if you can drill down and find out where in that method it's spending the most time. I agree with Mark -- a call graph would be very helpful in tracking down what's happening. There are background tasks that run. Are you running multisite? Do you have dynamic bucket resharding turned on? Are you using lifecycle? And garbage collection is another background task. And just to be clear -- sometimes all 3 of your rados gateways are simultaneously in this state? But the call graph would be incredibly helpful. Thank you, Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
On 8/21/19 10:22 AM, Mark Nelson wrote: > Hi Vladimir, > > > On 8/21/19 8:54 AM, Vladimir Brik wrote: >> Hello >> [much elided] > You might want to try grabbing a a callgraph from perf instead of just > running perf top or using my wallclock profiler to see if you can drill > down and find out where in that method it's spending the most time. I agree with Mark -- a call graph would be very helpful in tracking down what's happening. There are background tasks that run. Are you running multisite? Do you have dynamic bucket resharding turned on? Are you using lifecycle? And garbage collection is another background task. And just to be clear -- sometimes all 3 of your rados gateways are simultaneously in this state? But the call graph would be incredibly helpful. Thank you, Eric -- J. Eric Ivancich he/him/his Red Hat Storage Ann Arbor, Michigan, USA ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Correction: the number of threads stuck using 100% of a CPU core varies from 1 to 5 (it's not always 5) Vlad On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik wrote: > > Hello > > I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, > radosgw process on those machines starts consuming 100% of 5 CPU cores > for days at a time, even though the machine is not being used for data > transfers (nothing in radosgw logs, couple of KB/s of network). > > This situation can affect any number of our rados gateways, lasts from > few hours to few days and stops if radosgw process is restarted or on > its own. > > Does anybody have an idea what might be going on or how to debug it? I > don't see anything obvious in the logs. Perf top is saying that CPU is > consumed by radosgw shared object in symbol get_obj_data::flush, which, > if I interpret things correctly, is called from a symbol with a long > name that contains the substring "boost9intrusive9list_impl" > > This is our configuration: > rgw_frontends = civetweb num_threads=5000 port=443s > ssl_certificate=/etc/ceph/rgw.crt > error_log_file=/var/log/ceph/civetweb.error.log Probably unrelated to your problem, but running with lots of threads is usually an indicator that the async beast frontend would be a better fit for your setup. (But the code you see in perf should not be related to the frontend) Paul > > (error log file doesn't exist) > > > Thanks, > > Vlad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" I don't normally look at the RGW code so maybe Matt/Casey/Eric can chime in. That code is in src/rgw/rgw_rados.cc in the get_obj_data struct. The flush method does some sorting/merging and then walks through a listed of completed IOs and appears to copy a bufferlist out of each one, then deletes it from the list and passes the BL off to client_cb->handle_data. Looks like it could be pretty CPU intensive but if you are seeing that much CPU for that long it sounds like something is rather off. You might want to try grabbing a a callgraph from perf instead of just running perf top or using my wallclock profiler to see if you can drill down and find out where in that method it's spending the most time. My wallclock profiler is here: https://github.com/markhpc/gdbpmp Mark This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com