Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-10-10 Thread Paul Emmerich
I've also encountered this issue on a cluster yesterday; one CPU got
stuck in an infinite loop in get_obj_data::flush and it stopped
serving requests. I've updated the tracker issue accordingly.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log
>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-26 Thread Vladimir Brik

I created a ticket: https://tracker.ceph.com/issues/41511

Note that I think I was mistaken when I said that sometimes the problem 
goes away on its own. I've looked back through our monitoring and it 
looks like when the problem did go away, it was because either the 
machine was rebooted or the radosgw service was restarted.



Vlad



On 8/23/19 10:17 AM, Eric Ivancich wrote:

Good morning, Vladimir,

Please create a tracker for this 
(https://tracker.ceph.com/projects/rgw/issues/new) and include the link 
to it in an email reply. And if you can include any more potentially 
relevant details, please do so. I’ll add my initial analysis to it.


But the threads do seem to be stuck, at least for a while, in 
get_obj_data::flush despite a lack of traffic. And sometimes it 
self-resolves, so it’s not a true “infinite loop”.


Thank you,

Eric

On Aug 22, 2019, at 9:12 PM, Eric Ivancich <mailto:ivanc...@redhat.com>> wrote:


Thank you for providing the profiling data, Vladimir. There are 5078 
threads and most of them are waiting. Here is a list of the deepest 
call of each thread with duplicates removed.


            + 100.00% epoll_wait
                          + 100.00% 
get_obj_data::flush(rgw::OwningList&&)

            + 100.00% poll
        + 100.00% poll
      + 100.00% poll
        + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
      + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
        + 100.00% pthread_cond_wait@@GLIBC_2.3.2
      + 100.00% pthread_cond_wait@@GLIBC_2.3.2
      + 100.00% read
                            + 
100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


The only interesting ones are the second and last:

* get_obj_data::flush(rgw::OwningList&&)
* 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


They are essentially part of the same call stack that results from 
processing a GetObj request, and five threads are in this call stack 
(the only difference is wether or not they include the call into boost 
intrusive list). Here’s the full call stack of those threads:


+ 100.00% clone
  + 100.00% start_thread
    + 100.00% worker_thread
      + 100.00% process_new_connection
        + 100.00% handle_request
          + 100.00% RGWCivetWebFrontend::process(mg_connection*)
            + 100.00% process_request(RGWRados*, RGWREST*, 
RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, 
RGWRestfulIO*, OpsLogSocket*, opt

ional_yield, rgw::dmclock::Scheduler*, int*)
              + 100.00% rgw_process_authenticated(RGWHandler_REST*, 
RGWOp*&, RGWRequest*, req_state*, bool)

                + 100.00% RGWGetObj::execute()
                  + 100.00% RGWRados::Object::Read::iterate(long, 
long, RGWGetDataCB*)
                    + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
(*)(rgw_raw_obj const&, l

ong, long, long, bool, RGWObjState*, void*), void*)
                      + 100.00% _get_obj_iterate_cb(rgw_raw_obj 
const&, long, long, long, bool, RGWObjState*, void*)
                        + 100.00% 
RGWRados::get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, 
bool, RGWObjState*, void*)
                          + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
                            + 
100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


So this isn’t background processing but request processing. I’m not 
clear why these requests are consuming so much CPU for so long.


From your initial message:
I am running a Ceph 14.2.1 cluster with 3 rados gateways. 
Periodically, radosgw process on those machines starts consuming 100% 
of 5 CPU cores for days at a time, even though the machine is not 
being used for data transfers (nothing in radosgw logs, couple of 
KB/s of network).


This situation can affect any number of our rados gateways, lasts 
from few hours to few days and stops if radosgw process is restarted 
or on its own.


I’m going to check with others who’re more familiar with this code path.


Begin forwarded message:

*From:*Vladimir Brik <mailto:vladimir.b...@icecube.wisc.edu>>
*Subject:**Re: [ceph-users] radosgw pegging down 5 CPU cores when no 
data is being transferred*

*Date:*August 21, 2019 at 4:47:01 PM EDT
*To:*"J. Eric Ivancich" <mailto:ivanc...@redhat.com>>, Mark Nelson <mailto:mnel...@redhat.com>>,ceph-users@lists.ceph.com 
<mailto:ceph-users@lists.

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-23 Thread Eric Ivancich
Good morning, Vladimir,

Please create a tracker for this 
(https://tracker.ceph.com/projects/rgw/issues/new 
<https://tracker.ceph.com/projects/rgw/issues/new>) and include the link to it 
in an email reply. And if you can include any more potentially relevant 
details, please do so. I’ll add my initial analysis to it.

But the threads do seem to be stuck, at least for a while, in 
get_obj_data::flush despite a lack of traffic. And sometimes it self-resolves, 
so it’s not a true “infinite loop”.

Thank you,

Eric

> On Aug 22, 2019, at 9:12 PM, Eric Ivancich  wrote:
> 
> Thank you for providing the profiling data, Vladimir. There are 5078 threads 
> and most of them are waiting. Here is a list of the deepest call of each 
> thread with duplicates removed.
> 
> + 100.00% epoll_wait
>   + 100.00% 
> get_obj_data::flush(rgw::OwningList&&)
> + 100.00% poll
> + 100.00% poll
>   + 100.00% poll
> + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
>   + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
> + 100.00% pthread_cond_wait@@GLIBC_2.3.2
>   + 100.00% pthread_cond_wait@@GLIBC_2.3.2
>   + 100.00% read
> + 100.00% 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> The only interesting ones are the second and last:
> 
> * get_obj_data::flush(rgw::OwningList&&)
> * 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> They are essentially part of the same call stack that results from processing 
> a GetObj request, and five threads are in this call stack (the only 
> difference is wether or not they include the call into boost intrusive list). 
> Here’s the full call stack of those threads:
> 
> + 100.00% clone
>   + 100.00% start_thread
> + 100.00% worker_thread
>   + 100.00% process_new_connection
> + 100.00% handle_request
>   + 100.00% RGWCivetWebFrontend::process(mg_connection*)
> + 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, 
> std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, 
> OpsLogSocket*, opt
> ional_yield, rgw::dmclock::Scheduler*, int*)
>   + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, 
> RGWRequest*, req_state*, bool)
> + 100.00% RGWGetObj::execute()
>   + 100.00% RGWRados::Object::Read::iterate(long, long, 
> RGWGetDataCB*)
> + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
> RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
> (*)(rgw_raw_obj const&, l
> ong, long, long, bool, RGWObjState*, void*), void*)
>   + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, 
> long, long, bool, RGWObjState*, void*)
> + 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj 
> const&, long, long, long, bool, RGWObjState*, void*)
>   + 100.00% 
> get_obj_data::flush(rgw::OwningList&&)
> + 100.00% 
> _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_
> 
> So this isn’t background processing but request processing. I’m not clear why 
> these requests are consuming so much CPU for so long.
> 
> From your initial message:
>> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
>> radosgw process on those machines starts consuming 100% of 5 CPU cores for 
>> days at a time, even though the machine is not being used for data transfers 
>> (nothing in radosgw logs, couple of KB/s of network).
>> 
>> This situation can affect any number of our rados gateways, lasts from few 
>> hours to few days and stops if radosgw process is restarted or on its own.
> 
> 
> I’m going to check with others who’re more familiar with this code path.
> 
>> Begin forwarded message:
>> 
>> From: Vladimir Brik > <mailto:vladimir.b...@icecube.wisc.edu>>
>> Subject: Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is 
>> being transferred
>> Date: August 21, 2019 at 4:47:01 PM EDT
>> To: "J. Eric Ivancich" mailto:ivanc...@redhat.com>>, 
>> Mark Nelson mailto:mn

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik

> Are you running multisite?
No

> Do you have dynamic bucket resharding turned on?
Yes. "radosgw-admin reshard list" prints "[]"

> Are you using lifecycle?
I am not sure. How can I check? "radosgw-admin lc list" says "[]"

> And just to be clear -- sometimes all 3 of your rados gateways are
> simultaneously in this state?
Multiple, but I have not seen all 3 being in this state simultaneously. 
Currently one gateway has 1 thread using 100% of CPU, and another has 5 
threads each using 100% CPU.


Here are the fruits of my attempts to capture the call graph using perf 
and gdbpmp:

https://icecube.wisc.edu/~vbrik/perf.data
https://icecube.wisc.edu/~vbrik/gdbpmp.data

These are the commands that I ran and their outputs (note I couldn't get 
perf not to generate the warning):

rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data
Attaching to process 73688...Done.
Gathering 
Samples

Profiling complete with 100 samples.

rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10
[ perf record: Woken up 54 times to write data ]
Warning:
Processed 574207 events and lost 4 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ]





Vlad



On 8/21/19 11:16 AM, J. Eric Ivancich wrote:

On 8/21/19 10:22 AM, Mark Nelson wrote:

Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello



[much elided]


You might want to try grabbing a a callgraph from perf instead of just
running perf top or using my wallclock profiler to see if you can drill
down and find out where in that method it's spending the most time.


I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread J. Eric Ivancich
On 8/21/19 10:22 AM, Mark Nelson wrote:
> Hi Vladimir,
> 
> 
> On 8/21/19 8:54 AM, Vladimir Brik wrote:
>> Hello
>>

[much elided]

> You might want to try grabbing a a callgraph from perf instead of just
> running perf top or using my wallclock profiler to see if you can drill
> down and find out where in that method it's spending the most time.

I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik
Correction: the number of threads stuck using 100% of a CPU core varies 
from 1 to 5 (it's not always 5)


Vlad

On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
radosgw process on those machines starts consuming 100% of 5 CPU cores 
for days at a time, even though the machine is not being used for data 
transfers (nothing in radosgw logs, couple of KB/s of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, which, 
if I interpret things correctly, is called from a symbol with a long 
name that contains the substring "boost9intrusive9list_impl"


This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Paul Emmerich
On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log

Probably unrelated to your problem, but running with lots of threads
is usually an indicator that the async beast frontend would be a
better fit for your setup.
(But the code you see in perf should not be related to the frontend)


Paul

>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Mark Nelson

Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. 
Periodically, radosgw process on those machines starts consuming 100% 
of 5 CPU cores for days at a time, even though the machine is not 
being used for data transfers (nothing in radosgw logs, couple of KB/s 
of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, 
which, if I interpret things correctly, is called from a symbol with a 
long name that contains the substring "boost9intrusive9list_impl"



I don't normally look at the RGW code so maybe Matt/Casey/Eric can chime 
in.  That code is in src/rgw/rgw_rados.cc in the get_obj_data struct.  
The flush method does some sorting/merging and then walks through a 
listed of completed IOs and appears to copy a bufferlist out of each 
one, then deletes it from the list and passes the BL off to 
client_cb->handle_data.  Looks like it could be pretty CPU intensive but 
if you are seeing that much CPU for that long it sounds like something 
is rather off.



You might want to try grabbing a a callgraph from perf instead of just 
running perf top or using my wallclock profiler to see if you can drill 
down and find out where in that method it's spending the most time.



My wallclock profiler is here:


https://github.com/markhpc/gdbpmp


Mark




This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com