Tracker issue made here with some additional details:
https://tracker.ceph.com/issues/71501

Cluster version 18.2.4

I came to assist with a non-functional cluster which had OSDs erroneously
--force purged and led to multiple (6) degraded + inactive PGs (4 remaining
shards in a 4+2) and 1 remapped+incomplete PG (3 shards in a 4+2)

In an effort to restore order to the cluster and get backfill working: the
degraded PGs shared a common primary OSD and that OSD was restarted.
Additionally min_size was dropped from 5 to 4 for this pool (a temporary
measure while the cluster recovered).

This caused the inactive degraded PGs to go active and start their
backfill. The PGs steadily worked on backfill for a few hours.

Near immediately after the final of the 6 degraded PGs finished its
backfill the monitor quorum broke and the cluster became unresponsive. In
this state 2 of the 5 mons showed 100% cpu usage.

In an attempt to fix the mon quorum some combination of monitor service
restart attempts occurred and ultimately all ceph services were brought
down in order to isolate the MON issue.

As the situation stands currently any combination of starting 3 MONs
(quorum eligible at 3) causes the lowest MON (the to-be-leader) to hit a
100% cpu with fn_monstore thread and render unresponsive the admin socket
of that mon only (the other 2 respond via admin socket and show either
probing or electing).

I have captured logs of the to-be-leader mon (its daemon started first)
with debug_mon = 20 and debug_ms = 20 (I should probably recapture with
debug_paxos = 20). The pegged cpu only occurs once the third MON is started
(seems to be an election issue). The MON has been allowed to run for hours
with no progress in that state. Eventually it seems the leader goes into
the (leader) state (Claims "I win" in the logs) but the other 2 mons
continue their election cycle. In states of probing or electing still.

I have verified NTP clock is fine and sync'd to same source on all the mons
and connectivity between the mons at both ports is functional. The
situation the PGs were in during the subsequent MON fault leads me to
believe the problem is more complex than typical monitor election issues.
Also of note the backing disk of the to-be-leader MON is mostly idle.

At this point I am interested in taking backups of all the mon stores and
injecting a modified single-mon monmap in a mon and seeing if we can get
back up but am also concerned that that single mon will be the de-facto
leader and also be unresponsive. Interested in any suggestions from the
wider community. Thanks!


Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
w...@wesdillingham.com
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to