Perhaps run "iostat -xtcy <list of OSD devices> 5" on the OSD hosts to
see if any of the drives have weirdly high utilization despite low
iops/requests?


Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens <b...@kervyn.de>:
>
> Hi Sven,
> I am searching really hard for defect hardware, but I am currently out of
> ideas:
> - checked prometheus stats, but in all that data I don't know what to look
> for (osd apply latency if very low at the mentioned point and went up to
> 40ms after all OSDs were restarted)
> - smartctl shows nothing
> - dmesg show nothing
> - network data shows nothing
> - osd and clusterlogs show nothing
>
> If anybody got a good tip what I can check, that would be awesome. A string
> in the logs (I made a copy from that days logs), or a tool to fire against
> the hardware. I am 100% out of ideas what it could be.
> In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am
> waiting for the replicas to do their work" (log message 'waiting for sub
> ops'). But there was no alert that any OSD had connection problems to other
> OSDs. Additional the cluster_network is the same interface, switch,
> everything as public_network. Only difference is the VLAN id (I plan to
> remove the cluster_network because it does not provide anything for us).
>
> I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer
> kernel, standardized OS config and so on).
>
> Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske <s.kie...@mittwald.de
> >:
>
> > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote:
> > > hi,
> > > maybe someone here can help me to debug an issue we faced today.
> > >
> > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs
> > > reporting slow ops.
> > > Only option to get it back to work fast, was to restart all OSDs daemons.
> > >
> > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
> > > on the cluster: synced in a node 4 days ago.
> > >
> > > The only health issue, that was reported, was the SLOW_OPS. No slow pings
> > > on the networks. No restarting OSDs. Nothing.
> > >
> > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
> > > minute timeframe around this issue.
> > >
> > > I haven't found any clues.
> > >
> > > Maybe someone encountered this in the past?
> >
> > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)?
> >
> > I observed slow ops in octopus after a faulty nvme ssd was inserted in one
> > ceph server.
> > as was said in other mails, try to isolate your root cause.
> >
> > maybe the node added 4 days ago was the culprit here?
> >
> > we were able to pinpoint the nvme by monitoring the slow osds
> > and the commonality in this case was the same nvme cache device.
> >
> > you should always benchmark new hardware/perform burn-in tests imho, which
> > is not always possible due to environment constraints.
> >
> > --
> > Mit freundlichen Grüßen / Regards
> >
> > Sven Kieske
> > Systementwickler / systems engineer
> >
> >
> > Mittwald CM Service GmbH & Co. KG
> > Königsberger Straße 4-6
> > 32339 Espelkamp
> >
> > Tel.: 05772 / 293-900
> > Fax: 05772 / 293-333
> >
> > https://www.mittwald.de
> >
> > Geschäftsführer: Robert Meyer, Florian Jürgens
> >
> > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
> > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen
> >
> > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit
> > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar.
> >
> >
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to