Correction: the number of threads stuck using 100% of a CPU core varies from 1 to 5 (it's not always 5)

Vlad

On 8/21/19 8:54 AM, Vladimir Brik wrote:
Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network).

This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own.

Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl"

This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log

(error log file doesn't exist)


Thanks,

Vlad
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to