I then follow someone's guidance, add 'mon compact on start = true' to the config and restart one mon. That mon has not joined the cluster until I added two mon deployed on virtual machines with ssd into the cluster.
And now the cluster is fine except the pg status. [image: image.png] [image: image.png] Zhenshi Zhou <deader...@gmail.com> 于2020年10月29日周四 下午8:29写道: > Hi, > > I was so anxious a few hours ago cause the sst files were growing so fast > and I don't think > the space on mon servers could afford it. > > Let me talk it from the beginning. I have a cluster with OSD deployed on > SATA(7200rpm). > 10T each OSD and I used ec pool for more space.I added new OSDs into the > cluster last > week and it has recovered well so far. After that, while the cluster is > still recovering, I increased the pg_num. > Besides that, the clients still write data to the server all the time. > > And the cluster became unhealthy last night. Some osds were down and one > mon was down. > Then I found the mon servers' root directories were lack of free space. > The sst files in /var/lib/ceph/mon/ceph-xxx/store.db/ > were growing rapidly. > > > Frank Schilder <fr...@dtu.dk> 于2020年10月29日周四 下午7:15写道: > >> I think you really need to sit down and explain the full story. Dropping >> one-liners with new information will not work via e-mail. >> >> I have never heard of the problem you are facing, so you did something >> that possibly no-one else has done before. Unless we know the full history >> from the last time the cluster was health_ok until now, it will almost >> certainly not be possible to figure out what is going on via e-mail. >> >> Usually, setting "norebalance" and "norecovery" should stop any recovery >> IO and allow the PGs to peer. If they do not become active, something is >> wrong and the information we got so far does not give a clue what this >> could be. >> >> Please post the output of "ceph health detail", "ceph osd pool stats" and >> "ceph osd pool ls detail" and a log of actions and results since last >> health_ok status here, maybe it gives a clue what is going on. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Zhenshi Zhou <deader...@gmail.com> >> Sent: 29 October 2020 09:44:14 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: [ceph-users] monitor sst files continue growing >> >> I reset the pg_num after adding osd, it made some pg inactive(in >> activating state) >> >> Frank Schilder <fr...@dtu.dk<mailto:fr...@dtu.dk>> 于2020年10月29日周四 >> 下午3:56写道: >> This does not explain incomplete and inactive PGs. Are you hitting >> https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not >> recover from OSD restart"? In that case, temporarily stopping and >> restarting all new OSDs might help. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Zhenshi Zhou <deader...@gmail.com<mailto:deader...@gmail.com>> >> Sent: 29 October 2020 08:30:25 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: [ceph-users] monitor sst files continue growing >> >> After add OSDs into the cluster, the recovery and backfill progress has >> not finished yet >> >> Zhenshi Zhou <deader...@gmail.com<mailto:deader...@gmail.com><mailto: >> deader...@gmail.com<mailto:deader...@gmail.com>>> 于2020年10月29日周四 >> 下午3:29写道: >> MGR is stopped by me cause it took too much memories. >> For pg status, I added some OSDs in this cluster, and it >> >> Frank Schilder <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk >> <mailto:fr...@dtu.dk>>> 于2020年10月29日周四 下午3:27写道: >> Your problem is the overall cluster health. The MONs store cluster >> history information that will be trimmed once it reaches HEALTH_OK. >> Restarting the MONs only makes things worse right now. The health status is >> a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to >> resolve. How did your cluster end up like this? >> >> It looks like all OSDs are up and in. You need to find out >> >> - why there are inactive PGs >> - why there are incomplete PGs >> >> This usually happens when OSDs go missing. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Zhenshi Zhou <deader...@gmail.com<mailto:deader...@gmail.com >> ><mailto:deader...@gmail.com<mailto:deader...@gmail.com>>> >> Sent: 29 October 2020 07:37:19 >> To: ceph-users >> Subject: [ceph-users] monitor sst files continue growing >> >> Hi all, >> >> My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db >> continue growing. It claims mon are using a lot of disk space. >> >> I set "mon compact on start = true" and restart one of the monitors. But >> it started and campacting for a long time, seems it has no end. >> >> [image.png] >> >
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io