Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

David Turner Wed, 21 Feb 2018 14:26:15 -0800

I've been out sick for a couple days. I agree with Bryan Stillwell about
setting those flags and doing a rolling restart of all of the osds is a
good next step.


On Wed, Feb 21, 2018, 3:49 PM Bryan Stillwell <bstillw...@godaddy.com>
wrote:

> Bryan,
>
>
>
> The good news is that there is progress being made on making this harder
> to screw up.  Read this article for example:
>
>
>
> https://ceph.com/community/new-luminous-pg-overdose-protection/
>
>
>
> The bad news is that I don't have a great solution for you regarding your
> peering problem.  I've run into things like that on testing clusters.  That
> almost always teaches me not to do too many operations at one time.
> Usually some combination of flags (norecover, norebalance, nobackfill,
> noout, etc.) with OSD restarts will fix the problem.  You can also query
> PGs to figure out why they aren't peering, increase logging, or if you want
> to get it back quickly you should consider RedHat support or contacting a
> Ceph consultant like Wido:
>
>
>
> In fact, I would recommend watching Wido's presentation on "10 ways to
> break your Ceph cluster" from Ceph Days Germany earlier this month for
> other things to watch out for:
>
>
>
> https://ceph.com/cephdays/germany/
>
>
>
> Bryan
>
>
>
> *From: *ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Bryan
> Banister <bbanis...@jumptrading.com>
> *Date: *Tuesday, February 20, 2018 at 2:53 PM
> *To: *David Turner <drakonst...@gmail.com>
> *Cc: *Ceph Users <ceph-users@lists.ceph.com>
>
>
> *Subject: *Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> HI David [Resending with smaller message size],
>
>
>
> I tried setting the OSDs down and that does clear the blocked requests
> momentarily but they just return back to the same state.  Not sure how to
> proceed here, but one thought was just to do a full cold restart of the
> entire cluster.  We have disabled our backups so the cluster is effectively
> down.  Any recommendations on next steps?
>
>
>
> This also seems like a pretty serious issue, given that making this change
> has effectively broken the cluster.  Perhaps Ceph should not allow you to
> increase the number of PGs so drastically or at least make you put in a
> ‘--yes-i-really-mean-it’ flag?
>
>
>
> Or perhaps just some warnings on the docs.ceph.com placement groups page (
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/ ) and
> the ceph command man page?
>
>
>
> Would be good to help other avoid this pitfall.
>
>
>
> Thanks again,
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com <drakonst...@gmail.com>]
>
> *Sent:* Friday, February 16, 2018 3:21 PM
> *To:* Bryan Banister <bbanis...@jumptrading.com>
> *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson <
> icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> *Note: External Email*
> ------------------------------
>
> That sounds like a good next step.  Start with OSDs involved in the
> longest blocked requests.  Wait a couple minutes after the osd marks itself
> back up and continue through them.  Hopefully things will start clearing up
> so that you don't need to mark all of them down.  There is usually a only a
> couple OSDs holding everything up.
>
>
>
> On Fri, Feb 16, 2018 at 4:15 PM Bryan Banister <bbanis...@jumptrading.com>
> wrote:
>
> Thanks David,
>
>
>
> Taking the list of all OSDs that are stuck reports that a little over 50%
> of all OSDs are in this condition.  There isn’t any discernable pattern
> that I can find and they are spread across the three servers.  All of the
> OSDs are online as far as the service is concern.
>
>
>
>
> I have also taken all PGs that were reported the health detail output and
> looked for any that report “peering_blocked_by” but none do, so I can’t
> tell if any OSD is actually blocking the peering operation.
>
>
>
> As suggested, I got a report of all peering PGs:
>
> [root@carf-ceph-osd01 ~]# ceph health detail | grep "pg " | grep peering
> | sort -k13
>
>     pg 14.fe0 is stuck peering since forever, current state peering, last
> acting [104,94,108]
>
>     pg 14.fe0 is stuck unclean since forever, current state peering, last
> acting [104,94,108]
>
>     pg 14.fbc is stuck peering since forever, current state peering, last
> acting [110,91,0]
>
>     pg 14.fd1 is stuck peering since forever, current state peering, last
> acting [130,62,111]
>
>     pg 14.fd1 is stuck unclean since forever, current state peering, last
> acting [130,62,111]
>
>     pg 14.fed is stuck peering since forever, current state peering, last
> acting [32,33,82]
>
>     pg 14.fed is stuck unclean since forever, current state peering, last
> acting [32,33,82]
>
>     pg 14.fee is stuck peering since forever, current state peering, last
> acting [37,96,68]
>
>     pg 14.fee is stuck unclean since forever, current state peering, last
> acting [37,96,68]
>
>     pg 14.fe8 is stuck peering since forever, current state peering, last
> acting [45,31,107]
>
>     pg 14.fe8 is stuck unclean since forever, current state peering, last
> acting [45,31,107]
>
>     pg 14.fc1 is stuck peering since forever, current state peering, last
> acting [59,124,39]
>
>     pg 14.ff2 is stuck peering since forever, current state peering, last
> acting [62,117,7]
>
>     pg 14.ff2 is stuck unclean since forever, current state peering, last
> acting [62,117,7]
>
>     pg 14.fe4 is stuck peering since forever, current state peering, last
> acting [84,55,92]
>
>     pg 14.fe4 is stuck unclean since forever, current state peering, last
> acting [84,55,92]
>
>     pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
>     pg 14.ffc is stuck peering since forever, current state peering, last
> acting [96,53,70]
>
>     pg 14.ffc is stuck unclean since forever, current state peering, last
> acting [96,53,70]
>
>
>
> Some have common OSDs but some OSDs only listed once.
>
>
>
> Should I try just marking OSDs with stuck requests down to see if that
> will re-assert them?
>
>
>
> Thanks!!
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Friday, February 16, 2018 2:51 PM
>
>
> *To:* Bryan Banister <bbanis...@jumptrading.com>
> *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson <
> icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> *Note: External Email*
> ------------------------------
>
> The questions I definitely know the answer to first, and then we'll
> continue from there.  If an OSD is blocking peering but is online, when you
> mark it as down in the cluster it receives a message in it's log saying it
> was wrongly marked down and tells the mons it is online.  That gets it to
> stop what it was doing and start talking again.  I referred to that as
> re-asserting.  If the OSD that you marked down doesn't mark itself back up
> within a couple minutes, restarting the OSD might be a good idea.  Then
> again actually restarting the daemon could be bad because the daemon is
> doing something.  With as much potential for places to work with to get
> things going, actually restarting the daemons is probably something I would
> wait to do for now.
>
>
>
> The reason the cluster doesn't know anything about the PG is because it's
> still creating and hasn't actually been created.  Starting with some of the
> OSDs that you see with blocked requests would be a good idea.  Eventually
> you'll down an OSD that when it comes back up things start looking much
> better as things start peering and getting better.  Below are the list of
> OSDs you had from a previous email that if they're still there with stuck
> requests then they'll be good to start doing this to.  On closer review,
> it's almost all of them... but you have to start somewhere.  Another
> possible place to start with these is to look at a list of all of the
> peering PGs and see if there are any common OSDs when you look at all of
> them at once.  Some patterns may emerge and would be good options to try.
>
>
>
>     osds 7,39,60,103,133 have stuck requests > 67108.9 sec
>
>     osds
> 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
> have stuck requests > 134218 sec
>
>     osds
> 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
> have stuck requests > 268435 sec
>
>
>
>
>
> On Fri, Feb 16, 2018 at 2:53 PM Bryan Banister <bbanis...@jumptrading.com>
> wrote:
>
> Thanks David,
>
>
>
> I have set the nobackfill, norecover, noscrub, and nodeep-scrub options at
> this point and the backfills have stopped.  I’ll also stop the backups from
> pushing into ceph for now.
>
>
>
> I don’t want to make things worse, so ask for some more guidance now.
>
>
>
> 1)      In looking at a PG that is still peering or one that is
> “unknown”, Ceph complains that it doesn’t have that pgid:
>
>     pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
> [root@carf-ceph-osd03 ~]# ceph pg 14.fb0 query
>
> Error ENOENT: i don't have pgid 14.fb0
>
> [root@carf-ceph-osd03 ~]#
>
>
>
> 2)      One that is activating shows this for the recovery_state:
>
> [root@carf-ceph-osd03 ~]# ceph pg 14.fe1 query | less
>
> [snip]
>
>     "recovery_state": [
>
>         {
>
>             "name": "Started/Primary/Active",
>
>             "enter_time": "2018-02-13 14:33:21.406919",
>
>             "might_have_unfound": [
>
>                 {
>
>                     "osd": "84(0)",
>
>                     "status": "not queried"
>
>                 }
>
>             ],
>
>             "recovery_progress": {
>
>                 "backfill_targets": [
>
>                     "56(0)",
>
>                     "87(1)",
>
>                     "88(2)"
>
>                 ],
>
>                 "waiting_on_backfill": [],
>
>                 "last_backfill_started": "MIN",
>
>                 "backfill_info": {
>
>                     "begin": "MIN",
>
>                     "end": "MIN",
>
>                     "objects": []
>
>                 },
>
>                 "peer_backfill_info": [],
>
>                 "backfills_in_flight": [],
>
>                 "recovering": [],
>
>                 "pg_backend": {
>
>                     "recovery_ops": [],
>
>                     "read_ops": []
>
>                 }
>
>             },
>
>             "scrub": {
>
>                 "scrubber.epoch_start": "0",
>
>                 "scrubber.active": false,
>
>                 "scrubber.state": "INACTIVE",
>
>                 "scrubber.start": "MIN",
>
>                 "scrubber.end": "MIN",
>
>                 "scrubber.subset_last_update": "0'0",
>
>                 "scrubber.deep": false,
>
>                 "scrubber.seed": 0,
>
>                 "scrubber.waiting_on": 0,
>
>                 "scrubber.waiting_on_whom": []
>
>             }
>
>         },
>
>         {
>
>             "name": "Started",
>
>             "enter_time": "2018-02-13 14:33:17.491148"
>
>         }
>
>     ],
>
>
>
> Sorry for all the hand holding, but how do I determine if I need to set an
> OSD as ‘down’ to fix the issues, and how does it go about re-asserting
> itself?
>
>
>
> I again tried looking at the ceph docs on troubleshooting OSDs but didn’t
> find any details.  Man page also has no details.
>
>
>
> Thanks again,
>
> -Bryan
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Friday, February 16, 2018 1:21 PM
> *To:* Bryan Banister <bbanis...@jumptrading.com>
> *Cc:* Bryan Stillwell <bstillw...@godaddy.com>; Janne Johansson <
> icepic...@gmail.com>; Ceph Users <ceph-users@lists.ceph.com>
>
>
> *Subject:* Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2
>
>
>
> *Note: External Email*
> ------------------------------
>
> Your problem might have been creating too many PGs at once.  I generally
> increase pg_num and pgp_num by no more than 256 at a time.  Making sure
> that all PGs are creating, peered, and healthy (other than backfilling).
>
>
>
> To help you get back to a healthy state, let's start off by getting all of
> your PGs peered.  Go ahead and put a stop to backfilling, recovery,
> scrubbing, etc.  Those are all hindering the peering effort right now.  The
> more clients you can disable is also better.
>
>
>
> ceph osd set nobackfill
>
> ceph osd set norecovery
>
> ceph osd set noscrubbing
>
> ceph osd set nodeep-scrubbing
>
>
>
> After that look at your peering PGs and find out what is blocking their
> peering.  This is where you might need to be using `ceph osd down 23`
> (assuming you needed to kick osd.23) to mark them down in the cluster and
> let them re-assert themselves.  Once you have all PGs done with peering, go
> ahead and unset nobackfill and norecovery and let the cluster start moving
> data around.  Leaving noscrubbing and nodeep-scrubbing off is optional and
> up to you.  I'll never say it's better to leave them off, but scrubbing
> does use a fair bit of spindles while you're trying to backfill.
>
>
>
> On Fri, Feb 16, 2018 at 2:12 PM Bryan Banister <bbanis...@jumptrading.com>
> wrote:
>
> Well I decided to try the increase in PGs to 4096 and that seems to have
> caused some issues:
>
>
>
> 2018-02-16 12:38:35.798911 mon.carf-ceph-osd01 [ERR] overall HEALTH_ERR
> 61802168/241154376 objects misplaced (25.628%); Reduced data availability:
> 2081 pgs inactive, 322 pgs peering; Degraded data redundancy: 552/241154376
> objects degraded (0.000%), 3099 pgs unclean, 38 pgs degraded; 163 stuck
> requests are blocked > 4096 sec
>
>
>
> The cluster is actively backfilling misplaced objects, but not all PGs are
> active at this point and may are stuck peering, stuck unclean, or have a
> state of unknown:
>
> PG_AVAILABILITY Reduced data availability: 2081 pgs inactive, 322 pgs
> peering
>
>     pg 14.fae is stuck inactive for 253360.025730, current state
> activating+remapped, last acting [85,12,41]
>
>     pg 14.faf is stuck inactive for 253368.511573, current state unknown,
> last acting []
>
>     pg 14.fb0 is stuck peering since forever, current state peering, last
> acting [94,30,38]
>
>     pg 14.fb1 is stuck inactive for 253362.605886, current state
> activating+remapped, last acting [6,74,34]
>
> [snip]
>
>
>
> The health also shows a large number of degraded data redundancy PGs:
>
> PG_DEGRADED Degraded data redundancy: 552/241154376 objects degraded
> (0.000%), 3099 pgs unclean, 38 pgs degraded
>
>     pg 14.fc7 is stuck unclean for 253368.511573, current state unknown,
> last acting []
>
>     pg 14.fc8 is stuck unclean for 531622.531271, current state
> active+remapped+backfill_wait, last acting [73,132,71]
>
>     pg 14.fca is stuck unclean for 420540.396199, current state
> active+remapped+backfill_wait, last acting [0,80,61]
>
>     pg 14.fcb is stuck unclean for 531622.421855, current state
> activating+remapped, last acting [70,26,75]
>
> [snip]
>
>
>
> We also now have a number of stuck requests:
>
> REQUEST_STUCK 163 stuck requests are blocked > 4096 sec
>
>     69 ops are blocked > 268435 sec
>
>     66 ops are blocked > 134218 sec
>
>    28 ops are blocked > 67108.9 sec
>
>     osds 7,39,60,103,133 have stuck requests > 67108.9 sec
>
>     osds
> 5,12,13,28,33,40,55,56,61,64,69,70,75,83,92,96,110,114,119,122,123,129,131
> have stuck requests > 134218 sec
>
>     osds
> 4,8,10,15,16,20,27,29,30,31,34,37,38,42,43,44,47,48,49,51,52,57,66,68,73,81,84,85,87,90,95,97,99,100,102,105,106,107,108,111,112,113,121,124,127,130,132
> have stuck requests > 268435 sec
>
>
>
> I tried looking through the mailing list archive on how to solve the stuck
> requests, and it seems that restarting the OSDs is the right way?
>
>
>
> At this point we have just been watching the backfills running and see a
> steady but slow decrease of misplaced objects.  When the cluster is idle,
> the overall OSD disk utilization is not too bad at roughly 40% on the
> physical disks running these backfills.
>
>
>
> However we still have our backups trying to push new images to the
> cluster.  This worked ok for the first few days, but yesterday we were
> getting failure alerts.  I checked the status of the RGW service and
> noticed that 2 of the 3 RGW civetweb servers where not responsive.  I
> restarted the RGWs on the ones that appeared hung and that got them working
> for a while, but then the same condition happened.  The RGWs seem to have
> recovered on their own now, but again the cluster is idle and only
> backfills are currently doing anything (that I can tell).  I did see these
> log entries:
>
> 2018-02-15 16:46:07.541542 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
>
> 2018-02-15 16:46:12.541613 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
>
> 2018-02-15 16:46:12.541629 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffcec26700' had timed out after 600
>
> 2018-02-15 16:46:17.541701 7fffe6c56700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdbc40700' had timed out after 600
>
>
>
> At this point we do not know to proceed with recovery efforts.  I tried
> looking at the ceph docs and mail list archives but wasn’t able to
> determine the right path forward here.
>
>
>
> Any help is appreciated,
>
> -Bryan
>
>
>
>
> ------------------------------
>
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential or privileged information.
> If you are not the intended recipient, you are hereby notified that any
> review, dissemination or copying of this email is strictly prohibited, and
> to please notify the sender immediately and destroy this email and any
> attachments. Email transmission cannot be guaranteed to be secure or
> error-free. The Company, therefore, does not make any guarantees as to the
> completeness or accuracy of this email or any attachments. This email is
> for informational purposes only and does not constitute a recommendation,
> offer, request or solicitation of any kind to buy, sell, subscribe, redeem
> or perform any type of transaction of a financial product.
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help rebalancing OSD usage, Luminus 1.2.2

Reply via email to