I've also seen this problem on Nautilus with no obvious reason for the
slowness once.
In my case it was a rather old cluster that was upgraded all the way
from firefly


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Feb 18, 2020 at 5:52 PM Wido den Hollander <w...@42on.com> wrote:
>
>
>
> On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> > We've run into a problem on our test cluster this afternoon which is 
> > running Nautilus (14.2.2).  It seems that any time PGs move on the cluster 
> > (from marking an OSD down, setting the primary-affinity to 0, or by using 
> > the balancer), a large number of the OSDs in the cluster peg the CPU cores 
> > they're running on for a while which causes slow requests.  From what I can 
> > tell it appears to be related to slow peering caused by osd_pg_create() 
> > taking a long time.
> >
> > This was seen on quite a few OSDs while waiting for peering to complete:
> >
> > # ceph daemon osd.3 ops
> > {
> >     "ops": [
> >         {
> >             "description": "osd_pg_create(e179061 287.7a:177739 
> > 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 
> > 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> >             "initiated_at": "2019-08-27 14:34:46.556413",
> >             "age": 318.25234538000001,
> >             "duration": 318.25241895300002,
> >             "type_data": {
> >                 "flag_point": "started",
> >                 "events": [
> >                     {
> >                         "time": "2019-08-27 14:34:46.556413",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556413",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556299",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:34:46.556456",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456901",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456903",
> >                         "event": "wait for new map"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:40:01.292346",
> >                         "event": "started"
> >                     }
> >                 ]
> >             }
> >         },
> > ...snip...
> >         {
> >             "description": "osd_pg_create(e179066 287.7a:177739 
> > 287.9a:177739 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 
> > 287.1aa:177739 287.216:177739 287.306:177739 287.3e6:177739)",
> >             "initiated_at": "2019-08-27 14:35:09.908567",
> >             "age": 294.900191001,
> >             "duration": 294.90068416899999,
> >             "type_data": {
> >                 "flag_point": "delayed",
> >                 "events": [
> >                     {
> >                         "time": "2019-08-27 14:35:09.908567",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908567",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908520",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:09.908617",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456921",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2019-08-27 14:35:12.456923",
> >                         "event": "wait for new map"
> >                     }
> >                 ]
> >             }
> >         }
> >     ],
> >     "num_ops": 6
> > }
> >
> >
> > That "wait for new map" message made us think something was getting hung up 
> > on the monitors, so we restarted them all without any luck.
> >
> > I'll keep investigating, but so far my google searches aren't pulling 
> > anything up so I wanted to see if anyone else is running into this?
> >
>
> I've seen this twice now on a ~1400 OSD cluster running Nautilus.
>
> I created a bug report for this: https://tracker.ceph.com/issues/44184
>
> Did you make any progress on this or run into it a second time?
>
> Wido
>
> > Thanks,
> > Bryan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to