[ceph-users] Re: Slow peering caused by "wait for new map"

Bryan Stillwell Wed, 04 Sep 2019 14:13:25 -0700

Thanks Arthur,

I don't think it's being caused by fragile OSDs.  I'm now suspecting that it 
could be related to msgr v2.  I forgot to mention that this cluster's mon nodes 
are OpenStack VMs.  I'm wondering if there's some network configuration I need 
to change to get v2 working correctly (I've opened up TCP port 3300, but maybe 
something else needs to be done?)  Right now I'm trying to figure out how to 
disable msgr v2 to test this theory, but setting ms_bind_msgr2 to false doesn't 
seem to be working.  The mons are still listening on port 3300, and the OSDs 
are still connecting to the mons on that same port.


Does anyone know how to disable msgr v2?  I've also tried setting mon_host to 
only list the mon IPs with v1 and port 6789, but that didn't work either.

Bryan

On Sep 4, 2019, at 12:54 PM, Guilherme Geronimo 
<[email protected]<mailto:[email protected]>> wrote:

Notice: This email is from an external sender.


Btw: After the storm, I highly suggest you to consider to use Jumboframe.
It works like a charm.

[]'s
Arthur (aKa Guilherme Geronimo)

On 04/09/2019 15:50, Guilherme Geronimo wrote:

I see that you have many inactive PGs, probably because the 6 OSD OUT+DOWN.

Problems with "flapping" OSD I use to solved:

* setting NOUP flag
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag

Another solution is:
* Setting NOIN and NOUP flag
* Taking the fragile OSD out
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag
* Take a coffee and wait till all data are drain

[]'s
Arthur (aKa Guilherme Geronimo)

On 04/09/2019 15:32, Bryan Stillwell wrote:
We are not using jumbo frames anywhere on this cluster (all mtu 1500).  The 
cluster was originally built in October of 2016 and has the following history:

2016-10-04: Created with Hammer (0.94.3)
2017-05-03: Upgraded to Hammer (0.94.10)
2017-10-09: Upgraded to Jewel (10.2.9)
2017-11-02: Upgraded to Jewel (10.2.10)
2018-04-30: Upgraded to Luminous (12.2.5)
2018-09-05: Upgraded to Luminous (12.2.8)
2019-04-05: Upgraded to Luminous (12.2.11)
2019-04-18: Upgraded to Luminous (12.2.12)
2019-07-26: Upgraded to Nautilus (14.2.2)

It wasn't until after the Nautilus upgrade when this problem started showing up.

Here's the output you requested:

[root@a2mon002 ~]# ceph -s
  cluster:
    id:     XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    health: HEALTH_ERR
            nodown,norebalance,noscrub,nodeep-scrub flag(s) set
            1 nearfull osd(s)
            19 pool(s) nearfull
            1 scrub errors
            Reduced data availability: 6014 pgs inactive, 3 pgs down, 5958 pgs 
peering, 83 pgs stale
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 1601/81648846 objects degraded (0.002%), 
4 pgs degraded, 5 pgs undersized
            1048 slow requests are blocked > 32 sec

  services:
    mon: 3 daemons, quorum a2mon002,a2mon003,a2mon004 (age 17m)
    mgr: a2mon004(active, since 53m), standbys: a2mon003, a2mon002
    mds: cephfs:2 {0=a2mon004=up:active(laggy or 
crashed),1=a2mon003=up:active(laggy or crashed)} 1 up:standby
    osd: 143 osds: 141 up, 137 in; 486 remapped pgs
         flags nodown,norebalance,noscrub,nodeep-scrub

  data:
    pools:   20 pools, 6288 pgs
    objects: 27.22M objects, 98 TiB
    usage:   308 TiB used, 114 TiB / 422 TiB avail
    pgs:     0.048% pgs unknown
             95.611% pgs not active
             1601/81648846 objects degraded (0.002%)
             53012/81648846 objects misplaced (0.065%)
             5379 peering
             495  remapped+peering
             269  active+clean
             75   stale+peering
             46   activating
             7    stale+remapped+peering
             3    unknown
             3    active+undersized+degraded
             3    down
             2    activating+remapped
             1    activating+undersized
             1    active+clean+scrubbing
             1    remapped+inconsistent+peering
             1    activating+undersized+degraded
             1    stale+activating
             1    creating+peering

[root@a2mon002 ~]# ceph versions
{
    "mon": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)": 3
    },
    "osd": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)": 141
    },
    "mds": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
nautilus (stable)": 148
    }
}


We had seen the slow peering shortly after the Nautilus upgrade, but it 
eventually recovered.  We then started filling the cluster up to test another 
Nautilus bug (https://tracker.ceph.com/issues/41255), but then a disk started 
to die (which caused the inconsistent PG).  When we marked it out we ran into 
this peering problem again, but it seems much worse this time.

Thanks,
Bryan

On Sep 4, 2019, at 11:55 AM, Guilherme Geronimo 
<[email protected]<mailto:[email protected]>> wrote:

Notice: This email is from an external sender.



Hey Bryan,

I suppose all nodes are using jumboframes (mtu 9000), right?
I would suggest to check OSD->MON communication.

Can you send the output os these commands for us?
* ceph -s
* ceph versions

[]'s
Arthur (aKa Guilherme Geronimo)

On 04/09/2019 14:18, Bryan Stillwell wrote:
Our test cluster is seeing a problem where peering is going incredibly slow 
shortly after upgrading it to Nautilus (14.2.2) from Luminous (12.2.12).

>From what I can tell it seems to be caused by "wait for new map" taking a long 
>time.  When looking at dump_historic_slow_ops on pretty much any OSD I see 
>stuff like this:

# ceph daemon osd.112 dump_historic_slow_ops
[...snip...]
       {
           "description": "osd_pg_create(e180614 287.4b:177739 287.75:177739 
287.1c3:177739 287.1cf:177739 287.1e1:177739 287.2dd:177739 287.2fc:177739 
287.342:177739 287.382:177739)",
           "initiated_at": "2019-09-03 15:12:41.366514",
           "age": 4800.8847047119998,
           "duration": 4780.0579745630002,
           "type_data": {
               "flag_point": "started",
               "events": [
                   {
                       "time": "2019-09-03 15:12:41.366514",
                       "event": "initiated"
                   },
                   {
                       "time": "2019-09-03 15:12:41.366514",
                       "event": "header_read"
                   },
                   {
                       "time": "2019-09-03 15:12:41.366501",
                       "event": "throttled"
                   },
                   {
                       "time": "2019-09-03 15:12:41.366547",
                       "event": "all_read"
                   },
                   {
                       "time": "2019-09-03 15:39:03.379456",
                       "event": "dispatched"
                   },
                   {
                       "time": "2019-09-03 15:39:03.379477",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 15:39:03.522376",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 15:53:55.912499",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 15:59:37.909063",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 16:00:43.356023",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 16:20:50.575498",
                       "event": "wait for new map"
                   },
                   {
                       "time": "2019-09-03 16:31:48.689415",
                       "event": "started"
                   },
                   {
                       "time": "2019-09-03 16:32:21.424489",
                       "event": "done"
                   }
               ]
           }

It always seems to be in osd_pg_create() with multiple "wait for new map" 
messages before it finally does something.  What could be causing it so long to 
get the OSD map?  The mons don't appear to be overloaded in any way.

Thanks,
Bryan
_______________________________________________
ceph-users mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>
_______________________________________________
ceph-users mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>

_______________________________________________
ceph-users mailing list -- [email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Slow peering caused by "wait for new map"

Reply via email to