I've been out sick for a couple days. I agree with Bryan Stillwell about setting those flags and doing a rolling restart of all of the osds is a good next step.
On Wed, Feb 21, 2018, 3:49 PM Bryan Stillwell <bstillw...@godaddy.com> wrote: > Bryan, > > > > The good news is that there is progress being made on making this harder > to screw up. Read this article for example: > > > > https://ceph.com/community/new-luminous-pg-overdose-protection/ > > > > The bad news is that I don't have a great solution for you regarding your > peering problem. I've run into things like that on testing clusters. That > almost always teaches me not to do too many operations at one time. > Usually some combination of flags (norecover, norebalance, nobackfill, > noout, etc.) with OSD restarts will fix the problem. You can also query > PGs to figure out why they aren't peering, increase logging, or if you want > to get it back quickly you should consider RedHat support or contacting a > Ceph consultant like Wido: > > > > In fact, I would recommend watching Wido's presentation on "10 ways to > break your Ceph cluster" from Ceph Days Germany earlier this month for > other things to watch out for: > > > > https://ceph.com/cephdays/germany/ > > > > Bryan > > > > *From: *ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Bryan > Banister <bbanis...@jumptrading.com> > *Date: *Tuesday, February 20, 2018 at 2:53 PM > *To: *David Turner <drakonst...@gmail.com> > *Cc: *Ceph Users <ceph-users@lists.ceph.com> > > > *Subject: *Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 > > > > HI David [Resending with smaller message size], > > > > I tried setting the OSDs down and that does clear the blocked requests > momentarily but they just return back to the same state. Not sure how to > proceed here, but one thought was just to do a full cold restart of the > entire cluster. We have disabled our backups so the cluster is effectively > down. Any recommendations on next steps? > > > > This also seems like a pretty serious issue, given that making this change > has effectively broken the cluster. Perhaps Ceph should not allow you to > increase the number of PGs so drastically or at least make you put in a > ‘--yes-i-really-mean-it’ flag? > > > > Or perhaps just some warnings on the docs.ceph.com placement groups page ( > http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and > the ceph command man page? > > > > Would be good to help other avoid this pitfall. > > > > Thanks again, > > -Bryan > > > > *From:* David Turner [mailto:drakonst...@gmail.com <drakonst...@gmail.com>] > > *Sent:* Friday, February 16, 2018 3:21 PM > *To:* Bryan Banister <bbanis...@jumptrading.com> > *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson < > icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 > > > > *Note: External Email* > ------------------------------ > > That sounds like a good next step. Start with OSDs involved in the > longest blocked requests. Wait a couple minutes after the osd marks itself > back up and continue through them. Hopefully things will start clearing up > so that you don't need to mark all of them down. There is usually a only a > couple OSDs holding everything up. > > > > On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister <bbanis...@jumptrading.com> > wrote: > > Thanks David, > > > > Taking the list of all OSDs that are stuck reports that a little over 50% > of all OSDs are in this condition. There isn’t any discernable pattern > that I can find and they are spread across the three servers. All of the > OSDs are online as far as the service is concern. > > > > > I have also taken all PGs that were reported the health detail output and > looked for any that report “peering_blocked_by” but none do, so I can’t > tell if any OSD is actually blocking the peering operation. > > > > As suggested, I got a report of all peering PGs: > > [root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering > | sort -k13 > > pg 14.fe0 is stuck peering since forever, current state peering, last > acting [104,94,108] > > pg 14.fe0 is stuck unclean since forever, current state peering, last > acting [104,94,108] > > pg 14.fbc is stuck peering since forever, current state peering, last > acting [110,91,0] > > pg 14.fd1 is stuck peering since forever, current state peering, last > acting [130,62,111] > > pg 14.fd1 is stuck unclean since forever, current state peering, last > acting [130,62,111] > > pg 14.fed is stuck peering since forever, current state peering, last > acting [32,33,82] > > pg 14.fed is stuck unclean since forever, current state peering, last > acting [32,33,82] > > pg 14.fee is stuck peering since forever, current state peering, last > acting [37,96,68] > > pg 14.fee is stuck unclean since forever, current state peering, last > acting [37,96,68] > > pg 14.fe8 is stuck peering since forever, current state peering, last > acting [45,31,107] > > pg 14.fe8 is stuck unclean since forever, current state peering, last > acting [45,31,107] > > pg 14.fc1 is stuck peering since forever, current state peering, last > acting [59,124,39] > > pg 14.ff2 is stuck peering since forever, current state peering, last > acting [62,117,7] > > pg 14.ff2 is stuck unclean since forever, current state peering, last > acting [62,117,7] > > pg 14.fe4 is stuck peering since forever, current state peering, last > acting [84,55,92] > > pg 14.fe4 is stuck unclean since forever, current state peering, last > acting [84,55,92] > > pg 14.fb0 is stuck peering since forever, current state peering, last > acting [94,30,38] > > pg 14.ffc is stuck peering since forever, current state peering, last > acting [96,53,70] > > pg 14.ffc is stuck unclean since forever, current state peering, last > acting [96,53,70] > > > > Some have common OSDs but some OSDs only listed once. > > > > Should I try just marking OSDs with stuck requests down to see if that > will re-assert them? > > > > Thanks!! > > -Bryan > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Friday, February 16, 2018 2:51 PM > > > *To:* Bryan Banister <bbanis...@jumptrading.com> > *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson < > icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com> > *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 > > > > *Note: External Email* > ------------------------------ > > The questions I definitely know the answer to first, and then we'll > continue from there. If an OSD is blocking peering but is online, when you > mark it as down in the cluster it receives a message in it's log saying it > was wrongly marked down and tells the mons it is online. That gets it to > stop what it was doing and start talking again. I referred to that as > re-asserting. If the OSD that you marked down doesn't mark itself back up > within a couple minutes, restarting the OSD might be a good idea. Then > again actually restarting the daemon could be bad because the daemon is > doing something. With as much potential for places to work with to get > things going, actually restarting the daemons is probably something I would > wait to do for now. > > > > The reason the cluster doesn't know anything about the PG is because it's > still creating and hasn't actually been created. Starting with some of the > OSDs that you see with blocked requests would be a good idea. Eventually > you'll down an OSD that when it comes back up things start looking much > better as things start peering and getting better. Below are the list of > OSDs you had from a previous email that if they're still there with stuck > requests then they'll be good to start doing this to. On closer review, > it's almost all of them... but you have to start somewhere. Another > possible place to start with these is to look at a list of all of the > peering PGs and see if there are any common OSDs when you look at all of > them at once. Some patterns may emerge and would be good options to try. > > > > osds 7,39,60,103,133 have stuck requests > 67108.9 sec > > osds > 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 > have stuck requests > 134218 sec > > osds > 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 > have stuck requests > 268435 sec > > > > > > On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister <bbanis...@jumptrading.com> > wrote: > > Thanks David, > > > > I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at > this point and the backfills have stopped. I’ll also stop the backups from > pushing into ceph for now. > > > > I don’t want to make things worse, so ask for some more guidance now. > > > > 1) In looking at a PG that is still peering or one that is > “unknown”, Ceph complains that it doesn’t have that pgid: > > pg 14.fb0 is stuck peering since forever, current state peering, last > acting [94,30,38] > > [root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query > > Error ENOENT: i don't have pgid 14.fb0 > > [root@carf-ceph-osd03 ~]# > > > > 2) One that is activating shows this for the recovery_state: > > [root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less > > [snip] > > "recovery_state": [ > > { > > "name": "Started/Primary/Active", > > "enter_time": "2018-02-13 14:33:21.406919", > > "might_have_unfound": [ > > { > > "osd": "84(0)", > > "status": "not queried" > > } > > ], > > "recovery_progress": { > > "backfill_targets": [ > > "56(0)", > > "87(1)", > > "88(2)" > > ], > > "waiting_on_backfill": [], > > "last_backfill_started": "MIN", > > "backfill_info": { > > "begin": "MIN", > > "end": "MIN", > > "objects": [] > > }, > > "peer_backfill_info": [], > > "backfills_in_flight": [], > > "recovering": [], > > "pg_backend": { > > "recovery_ops": [], > > "read_ops": [] > > } > > }, > > "scrub": { > > "scrubber.epoch_start": "0", > > "scrubber.active": false, > > "scrubber.state": "INACTIVE", > > "scrubber.start": "MIN", > > "scrubber.end": "MIN", > > "scrubber.subset_last_update": "0'0", > > "scrubber.deep": false, > > "scrubber.seed": 0, > > "scrubber.waiting_on": 0, > > "scrubber.waiting_on_whom": [] > > } > > }, > > { > > "name": "Started", > > "enter_time": "2018-02-13 14:33:17.491148" > > } > > ], > > > > Sorry for all the hand holding, but how do I determine if I need to set an > OSD as ‘down’ to fix the issues, and how does it go about re-asserting > itself? > > > > I again tried looking at the ceph docs on troubleshooting OSDs but didn’t > find any details. Man page also has no details. > > > > Thanks again, > > -Bryan > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Friday, February 16, 2018 1:21 PM > *To:* Bryan Banister <bbanis...@jumptrading.com> > *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson < > icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com> > > > *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2 > > > > *Note: External Email* > ------------------------------ > > Your problem might have been creating too many PGs at once. I generally > increase pg_num and pgp_num by no more than 256 at a time. Making sure > that all PGs are creating, peered, and healthy (other than backfilling). > > > > To help you get back to a healthy state, let's start off by getting all of > your PGs peered. Go ahead and put a stop to backfilling, recovery, > scrubbing, etc. Those are all hindering the peering effort right now. The > more clients you can disable is also better. > > > > ceph osd set nobackfill > > ceph osd set norecovery > > ceph osd set noscrubbing > > ceph osd set nodeep-scrubbing > > > > After that look at your peering PGs and find out what is blocking their > peering. This is where you might need to be using `ceph osd down 23` > (assuming you needed to kick osd.23) to mark them down in the cluster and > let them re-assert themselves. Once you have all PGs done with peering, go > ahead and unset nobackfill and norecovery and let the cluster start moving > data around. Leaving noscrubbing and nodeep-scrubbing off is optional and > up to you. I'll never say it's better to leave them off, but scrubbing > does use a fair bit of spindles while you're trying to backfill. > > > > On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister <bbanis...@jumptrading.com> > wrote: > > Well I decided to try the increase in PGs to 4096 and that seems to have > caused some issues: > > > > 2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR > 61802168/241154376 objects misplaced (25.628%); Reduced data availability: > 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376 > objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck > requests are blocked > 4096 sec > > > > The cluster is actively backfilling misplaced objects, but not all PGs are > active at this point and may are stuck peering, stuck unclean, or have a > state of unknown: > > PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs > peering > > pg 14.fae is stuck inactive for 253360.025730, current state > activating+remapped, last acting [85,12,41] > > pg 14.faf is stuck inactive for 253368.511573, current state unknown, > last acting [] > > pg 14.fb0 is stuck peering since forever, current state peering, last > acting [94,30,38] > > pg 14.fb1 is stuck inactive for 253362.605886, current state > activating+remapped, last acting [6,74,34] > > [snip] > > > > The health also shows a large number of degraded data redundancy PGs: > > PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded > (0.000%), 3099 pgs unclean, 38 pgs degraded > > pg 14.fc7 is stuck unclean for 253368.511573, current state unknown, > last acting [] > > pg 14.fc8 is stuck unclean for 531622.531271, current state > active+remapped+backfill_wait, last acting [73,132,71] > > pg 14.fca is stuck unclean for 420540.396199, current state > active+remapped+backfill_wait, last acting [0,80,61] > > pg 14.fcb is stuck unclean for 531622.421855, current state > activating+remapped, last acting [70,26,75] > > [snip] > > > > We also now have a number of stuck requests: > > REQUEST_STUCK 163 stuck requests are blocked > 4096 sec > > 69 ops are blocked > 268435 sec > > 66 ops are blocked > 134218 sec > > 28 ops are blocked > 67108.9 sec > > osds 7,39,60,103,133 have stuck requests > 67108.9 sec > > osds > 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131 > have stuck requests > 134218 sec > > osds > 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132 > have stuck requests > 268435 sec > > > > I tried looking through the mailing list archive on how to solve the stuck > requests, and it seems that restarting the OSDs is the right way? > > > > At this point we have just been watching the backfills running and see a > steady but slow decrease of misplaced objects. When the cluster is idle, > the overall OSD disk utilization is not too bad at roughly 40% on the > physical disks running these backfills. > > > > However we still have our backups trying to push new images to the > cluster. This worked ok for the first few days, but yesterday we were > getting failure alerts. I checked the status of the RGW service and > noticed that 2 of the 3 RGW civetweb servers where not responsive. I > restarted the RGWs on the ones that appeared hung and that got them working > for a while, but then the same condition happened. The RGWs seem to have > recovered on their own now, but again the cluster is idle and only > backfills are currently doing anything (that I can tell). I did see these > log entries: > > 2018-02-15 16:46:07.541542 7fffe6c56700 1 heartbeat_map is_healthy > 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600 > > 2018-02-15 16:46:12.541613 7fffe6c56700 1 heartbeat_map is_healthy > 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600 > > 2018-02-15 16:46:12.541629 7fffe6c56700 1 heartbeat_map is_healthy > 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600 > > 2018-02-15 16:46:17.541701 7fffe6c56700 1 heartbeat_map is_healthy > 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600 > > > > At this point we do not know to proceed with recovery efforts. I tried > looking at the ceph docs and mail list archives but wasn’t able to > determine the right path forward here. > > > > Any help is appreciated, > > -Bryan > > > > > ------------------------------ > > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged information. > If you are not the intended recipient, you are hereby notified that any > review, dissemination or copying of this email is strictly prohibited, and > to please notify the sender immediately and destroy this email and any > attachments. Email transmission cannot be guaranteed to be secure or > error-free. The Company, therefore, does not make any guarantees as to the > completeness or accuracy of this email or any attachments. This email is > for informational purposes only and does not constitute a recommendation, > offer, request or solicitation of any kind to buy, sell, subscribe, redeem > or perform any type of transaction of a financial product. > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com