On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> We've run into a problem on our test cluster this afternoon which is running
> Nautilus (14.2.2). It seems that any time PGs move on the cluster (from
> marking an OSD down, setting the primary-affinity to 0, or by using the
> balancer), a large number of the OSDs in the cluster peg the CPU cores
> they're running on for a while which causes slow requests. From what I can
> tell it appears to be related to slow peering caused by osd_pg_create()
> taking a long time.
>
> This was seen on quite a few OSDs while waiting for peering to complete:
>
> # ceph daemon osd.3 ops
> {
> "ops": [
> {
> "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739
> 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
> 287.216:177739 287.306:177739 287.3e6:177739)",
> "initiated_at": "2019-08-27 14:34:46.556413",
> "age": 318.25234538000001,
> "duration": 318.25241895300002,
> "type_data": {
> "flag_point": "started",
> "events": [
> {
> "time": "2019-08-27 14:34:46.556413",
> "event": "initiated"
> },
> {
> "time": "2019-08-27 14:34:46.556413",
> "event": "header_read"
> },
> {
> "time": "2019-08-27 14:34:46.556299",
> "event": "throttled"
> },
> {
> "time": "2019-08-27 14:34:46.556456",
> "event": "all_read"
> },
> {
> "time": "2019-08-27 14:35:12.456901",
> "event": "dispatched"
> },
> {
> "time": "2019-08-27 14:35:12.456903",
> "event": "wait for new map"
> },
> {
> "time": "2019-08-27 14:40:01.292346",
> "event": "started"
> }
> ]
> }
> },
> ...snip...
> {
> "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739
> 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739
> 287.216:177739 287.306:177739 287.3e6:177739)",
> "initiated_at": "2019-08-27 14:35:09.908567",
> "age": 294.900191001,
> "duration": 294.90068416899999,
> "type_data": {
> "flag_point": "delayed",
> "events": [
> {
> "time": "2019-08-27 14:35:09.908567",
> "event": "initiated"
> },
> {
> "time": "2019-08-27 14:35:09.908567",
> "event": "header_read"
> },
> {
> "time": "2019-08-27 14:35:09.908520",
> "event": "throttled"
> },
> {
> "time": "2019-08-27 14:35:09.908617",
> "event": "all_read"
> },
> {
> "time": "2019-08-27 14:35:12.456921",
> "event": "dispatched"
> },
> {
> "time": "2019-08-27 14:35:12.456923",
> "event": "wait for new map"
> }
> ]
> }
> }
> ],
> "num_ops": 6
> }
>
>
> That "wait for new map" message made us think something was getting hung up
> on the monitors, so we restarted them all without any luck.
>
> I'll keep investigating, but so far my google searches aren't pulling
> anything up so I wanted to see if anyone else is running into this?
>
I've seen this twice now on a ~1400 OSD cluster running Nautilus.
I created a bug report for this: https://tracker.ceph.com/issues/44184
Did you make any progress on this or run into it a second time?
Wido
> Thanks,
> Bryan
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io