Hi Tyler, We do have a monitoring system visualized by grafana. I already checked CPU iowait and loadavg and CPU usage (user + system) as well, they are not high.
So I don't think the problem is resource. Regards, On Tue, Oct 7, 2025 at 6:53 AM Tyler Stachecki <[email protected]> wrote: > On Mon, Oct 6, 2025, 5:34 PM Sa Pham <[email protected]> wrote: > >> Hi Frédéric, >> >> I tried to restart every osd related to that PG many times but it didn't >> work. >> >> When the primary switch to OSD 101, we still have slow on OSD 101. So I >> think it’s not hardware issue >> >> Regards, >> >> >> >> >> >> >> On Mon, 6 Oct 2025 at 20:46 Frédéric Nass <[email protected]> >> wrote: >> >> > Could be an issue with the primary OSD which is now osd.130. Have you >> > checked osd.130 for any errors? >> > Maybe try restarting osd.130 and osd.302 one at a time and maybe 101 as >> > well, waiting for ~all PGs to become active+clean between all restarts. >> > >> > Could you please share a ceph status? So we get a better view of the >> > situation. >> > >> > Regards, >> > Frédéric. >> > >> > -- >> > Frédéric Nass >> > Ceph Ambassador France | Senior Ceph Engineer @ CLYSO >> > Try our Ceph Analyzer -- https://analyzer.clyso.com/ >> > https://clyso.com | [email protected] >> > >> > Le lun. 6 oct. 2025 à 14:19, Sa Pham <[email protected]> a écrit : >> > >> >> Hi Frédéric, >> >> >> >> I tried to repeer and deep scrub, but it's not working. >> >> >> >> Have you already checked the logs for osd.302 and /var/log/messages for >> >> any I/O-related issues? >> >> >> >> => I checked , there is no I/O error/issue. >> >> >> >> Regards, >> >> >> >> On Mon, Oct 6, 2025 at 3:15 PM Frédéric Nass <[email protected]> >> >> wrote: >> >> >> >>> Hi Sa, >> >>> >> >>> Regarding the output you provided, it appears that osd.302 is listed >> as >> >>> UP but not ACTING for PG 18.773: >> >>> >> >>> PG_STAT STATE >> >>> UP UP_PRIMARY ACTING >> >>> ACTING_PRIMARY >> >>> 18.773 active+undersized+degraded+remapped+backfilling >> >>> [302,150,138] 302 [130,101] 130 >> >>> >> >>> Have you already checked the logs for osd.302 and /var/log/messages >> for >> >>> any I/O-related issues? Could you also try running 'ceph pg repeer >> 18.773'? >> >>> >> >>> If this is the only PG for which `osd.302` is not acting and the >> >>> 'repeer' command does not resolve the issue, I would suggest >> attempting a >> >>> deep-scrub on this PG. >> >>> This might uncover errors that could potentially be fixed, either >> online >> >>> or offline. >> >>> >> >>> Regards, >> >>> Frédéric >> >>> >> >>> -- >> >>> Frédéric Nass >> >>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO >> >>> Try our Ceph Analyzer -- https://analyzer.clyso.com/ >> >>> https://clyso.com | [email protected] >> >>> >> >>> >> >>> Le lun. 6 oct. 2025 à 06:31, Sa Pham <[email protected]> a écrit : >> >>> >> >>>> Hello Eugen, >> >>>> >> >>>> >> >>>> This PG include: 254490 objects, size: 68095493667 bytes >> >>>> >> >>>> >> >>>> Regards, >> >>>> >> >>>> On Fri, Oct 3, 2025 at 9:10 PM Eugen Block <[email protected]> wrote: >> >>>> >> >>>> > Is it possible that this is a huge PG? What size does it have? But >> it >> >>>> > could also be a faulty disk. >> >>>> > >> >>>> > >> >>>> > Zitat von Sa Pham <[email protected]>: >> >>>> > >> >>>> > > *Hello everyone,* >> >>>> > > >> >>>> > > I’m running a Ceph cluster used as an RGW backend, and I’m >> facing an >> >>>> > issue >> >>>> > > with one particular placement group (PG). >> >>>> > > >> >>>> > > >> >>>> > > - >> >>>> > > >> >>>> > > Accessing objects from this PG is *extremely slow*. >> >>>> > > - >> >>>> > > >> >>>> > > Even running ceph pg <pg_id> takes a very long time. >> >>>> > > - >> >>>> > > >> >>>> > > The PG is currently *stuck in a degraded state*, so I’m unable >> >>>> to move >> >>>> > > it to other OSDs. >> >>>> > > >> >>>> > > >> >>>> > > Current ceph version is reef 18.2.7. >> >>>> > > >> >>>> > > Has anyone encountered a similar issue before or have any >> >>>> suggestions on >> >>>> > > how to troubleshoot and resolve it? >> >>>> > > >> >>>> > > >> >>>> > > Thanks in advance! >> >>>> > > _______________________________________________ >> >>>> > > ceph-users mailing list -- [email protected] >> >>>> > > To unsubscribe send an email to [email protected] >> >>>> > >> >>>> > >> >>>> > _______________________________________________ >> >>>> > ceph-users mailing list -- [email protected] >> >>>> > To unsubscribe send an email to [email protected] >> >>>> > >> >>>> >> >>>> >> >>>> -- >> >>>> Sa Pham Dang >> >>>> Skype: great_bn >> >>>> Phone/Telegram: 0986.849.582 >> >>>> _______________________________________________ >> >>>> ceph-users mailing list -- [email protected] >> >>>> To unsubscribe send an email to [email protected] >> >>>> >> >>> >> >> >> >> -- >> >> Sa Pham Dang >> >> Skype: great_bn >> >> Phone/Telegram: 0986.849.582 >> >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] > > > Hi, > > If it's a performance issue, particularly if it's manifesting as high CPU > load, you can usually pinpoint what's going on based on which symbol(s) are > hot according to `perf top -p <pid>`. > > If it's not CPU hot, `iostat` is worth a look to see if the kernel thinks > the block device is busy. > > Barring either of those it gets a bit trickier to tease out, but first > discern whether or not it's a resource issue and work backwards from there > would be my advice. > > Cheers, > Tyler > -- Sa Pham Dang Skype: great_bn Phone/Telegram: 0986.849.582 _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
