On 15.08.2019 16:38, huxia...@horebdata.cn wrote:
Dear folks,

I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still fine) due to some errors on rocksdb crash. I tried to restart that OSD but failed. So I tried to rebalance but encountered PGs inconsistent.

what can i do to make the cluster working again?

thanks a lot for helping me out

Samuel

**********************************************************************************
# ceph -s
   cluster:
     id:     289e3afa-f188-49b0-9bea-1ab57cc2beb8
     health: HEALTH_ERR
             pauserd,pausewr,noout flag(s) set
             191444 scrub errors
             Possible data damage: 376 pgs inconsistent
   services:
     mon: 3 daemons, quorum horeb71,horeb72,horeb73
     mgr: horeb73(active), standbys: horeb71, horeb72
     osd: 9 osds: 8 up, 8 in
          flags pauserd,pausewr,noout
   data:
     pools:   1 pools, 1024 pgs
     objects: 524.29k objects, 1.99TiB
     usage:   3.67TiB used, 2.58TiB / 6.25TiB avail
     pgs:     645 active+clean
              376 active+clean+inconsistent
              3   active+clean+scrubbing+deep


that was a lot of inconsistent pg's. When you say replication = 2 do you mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 min-size=1 ?

the reason i ask is that min-size=1 is a well known way to get into lots of problems. (one disk can accept a write alone, and before it is recoverd/backfilled the drive can die)

if you have min-size=1 i would recommend you set min-size=2 as the first step, to avoid creating more inconsistency while troubleshooting. if you have the space for it in the cluster you should also set size=3

if you run "#ceph health detail" you will get a list of the pg's that are inconsistent. check if there is a repeat offender osd in that list of pg's, and check that disk for issues. check dmesg and logs of the osd, and if there are smart errors.

You can try to repair the inconsistent pg's automagically by running the command "#ceph pg repair [pg id]" but make sure the hardware is good first.


good luck
Ronny


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to