Re: [ceph-users] Whole cluster flapping

Webert de Souza Lima Wed, 08 Aug 2018 07:06:59 -0700

So your OSDs are really too busy to respond heartbeats.
You'll be facing this for sometime until cluster loads get lower.


I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.
maybe you can schedule it for enable during the night and disabling in the
morning.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric <frederic.c...@sib.fr> wrote:

> Thx for the command line, I did take a look too it what I don’t really
> know what to search for, my bad….
>
> All this flapping is due to deep-scrub when it starts on an OSD things
> start to go bad.
>
>
>
> I set out all the OSDs that were flapping the most (1 by 1 after
> rebalancing) and it looks better even if some osds keep going down/up with
> the same message in logs :
>
>
>
> 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had
> timed out after 90
>
>
>
> (I update it to 90 instead of 15s)
>
>
>
> Regards,
>
>
>
>
>
>
>
> *De :* ceph-users <ceph-users-boun...@lists.ceph.com> *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 16:28
> *À :* ceph-users <ceph-users@lists.ceph.com>
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> oops, my bad, you're right.
>
>
>
> I don't know much you can see but maybe you can dig around performance
> counters and see what's happening on those OSDs, try these:
>
>
>
> ~# ceph daemonperf osd.XX
>
> ~# ceph daemon osd.XX perf dump
>
>
>
> change XX to your OSD numbers.
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric <frederic.c...@sib.fr>
> wrote:
>
> Pool is already deleted and no longer present in stats.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users <ceph-users-boun...@lists.ceph.com> *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 15:08
> *À :* ceph-users <ceph-users@lists.ceph.com>
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> Frédéric,
>
>
>
> see if the number of objects is decreasing in the pool with `ceph df
> [detail]`
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric <frederic.c...@sib.fr> wrote:
>
> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users <ceph-users-boun...@lists.ceph.com> *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users <ceph-users@lists.ceph.com>
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric <frederic.c...@sib.fr>
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update:
> 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
> (root=default,room=xxxx,host=xxxx) (8 reporters from different host after
> 54.650576 >= grace 54.300663)
>
> 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
> 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
> degraded, 201 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update:
> 78 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18
> 172.29.228.5:6812/14996 boot
>
> 2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update:
> 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs
> degraded, 201 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update:
> 11 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
>
>
> On the OSDs that failed logs are full of this kind of message :
>
> 2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
> 2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
>
>
>
> At first it seems like a network issue but we haven’t change a thing on
> the network and this cluster has been okay for months.
>
>
>
> I can’t figure out what is happening at the moment, some help will be
> greatly appreciated !
>
>
>
> Regards,
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Whole cluster flapping

Reply via email to