Hi Janek, I just checked and we upgraded both HDD and SSD firmwares to those versions released last month.
HDD firmware (DELL/Seagate 'ST16000NM006J'): https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=xf65r&lwp=rt NVMe firmware (DELL/Kioxia 'Dell Ent NVMe CM7 U.2 RI 1.92TB'): https://www.dell.com/support/home/en-us/drivers/DriversDetails?driverID=VH1YP&lwp=rt What models are your drives? Regards, Frédéric. ----- Le 12 Mai 25, à 9:37, Janek Bevendorff janek.bevendo...@uni-weimar.de a écrit : > Hi all, > > Kernel is 6.8.0 (Ubuntu). The thermal settings on iDrac are already > quite high and we have an overall good cooling system, so that shouldn't > cause any issues. Our cold aisle is around 20˚C. > > @Frédéric Do you have a link to the HDD firmware? I installed everything > that's available in the Dell catalogue. Also, I don't know whether CPU > usage is high during these lockups, since I cannot observe the host > state when it happens. It's like the entire node goes down until either > it recovers on its own or I do a hardreset. > > Janek > > > On 09/05/2025 15:08, Anthony D'Atri wrote: >> There are separate thermal, overall performance, and fan states in iDRAC. >> I’ve >> found that I often have to bump up the default “fan offset” for more cooling. >> >>> On May 9, 2025, at 8:51 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr> >>> wrote: >>> >>> Hi Janek, >>> >>> We just had a very similar issue with recent hardware (DELL R760xd) going >>> nuts >>> (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' as >>> not >>> responding in time. >>> >>> Switching the CPU profile to HPC (High Performance Computing) and the >>> thermal >>> settings to Maximum Performance (or is it Optimized?) in BIOS, and upgrading >>> HDD firmware to the latest one taht was only available from DELL's website >>> (not >>> yet in OpenManage catalog) fixed it. >>> >>> Maybe you can give it a try. >>> >>> Regards, >>> Frédéric. >>> >>> ----- Le 9 Mai 25, à 9:02, Janek Bevendorff janek.bevendo...@uni-weimar.de a >>> écrit : >>> >>>> Hi, it's happening again. I haven't fully upgraded the firmware on all >>>> hosts yet, but at least on all MDS. I managed to finish the Ceph >>>> upgrade, but now I'm randomly getting the soft lockups again (mostly, >>>> but not only) on the MDS hosts. >>>> >>>> Anything else I could check for? >>>> >>>> Janek >>>> >>>> >>>> On 16/04/2025 17:38, Janek Bevendorff wrote: >>>>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers >>>>> can check what they need. There are three updates in total: BIOS, NIC, >>>>> and Lifecycle Controller. >>>>> >>>>> I hope the BIOS update fixes this. >>>>> >>>>> >>>>> On 16/04/2025 17:16, Anthony D'Atri wrote: >>>>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous >>>>>> at the time. BIOS updates inherently require a reboot. Check for >>>>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model >>>>>> had at least one update after FCS. >>>>>> >>>>>> >>>>>>> The servers our Ceph runs on are all R730xd machines. >>>>>>> >>>>>>> I checked the Dell repository manager and it looks like there is at >>>>>>> least one BIOS update that's newer than what we've already >>>>>>> installed, so I've updated our Firmware repository and will schedule >>>>>>> the updates now. That's going to take a long while. >>>>>>> >>>>>>> >>>>>>> On 16/04/2025 16:16, Anthony D'Atri wrote: >>>>>>>> For whatever reason, in recent years I’ve seen these more often >>>>>>>> with Dells than other systems. My first thought was that maybe you >>>>>>>> were running an ancient kernel, but then I saw that you aren’t. Is >>>>>>>> the kernel you’re running the stock one that comes with your >>>>>>>> distribution? I’ve seen CPU reset events on R750s running an >>>>>>>> elrepo kernel. >>>>>>>> >>>>>>>> I suspect that some code change may have tickled a latent issue >>>>>>>> that perhaps you were fortunate to have not previously run into, >>>>>>>> but this is entirely speculation. >>>>>>>> >>>>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff >>>>>>>>> <janek.bevendo...@uni-weimar.de> wrote: >>>>>>>>> >>>>>>>>> Yes, they are older Dell PowerEdges. I have to check whether >>>>>>>>> there's newer firmware, but we've been running Ceph for years >>>>>>>>> without these problems. >>>>>>>>> >>>>>>>>> I checked the logs on the host on which I had a lockup just an >>>>>>>>> hour ago, but there's nothing besides the expected hardreset >>>>>>>>> messages. There are two older watchdog messages, but they are from >>>>>>>>> March: >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> SeqNumber = 2089 >>>>>>>>> Message ID = ASR0000 >>>>>>>>> Category = System >>>>>>>>> AgentID = SEL >>>>>>>>> Severity = Critical >>>>>>>>> Timestamp = 2025-03-27 07:16:03 >>>>>>>>> Message = The watchdog timer expired. >>>>>>>>> RawEventData = >>>>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>>>> >>>>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> SeqNumber = 2088 >>>>>>>>> Message ID = ASR0000 >>>>>>>>> Category = System >>>>>>>>> AgentID = SEL >>>>>>>>> Severity = Critical >>>>>>>>> Timestamp = 2025-03-27 07:06:41 >>>>>>>>> Message = The watchdog timer expired. >>>>>>>>> RawEventData = >>>>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>>>> >>>>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> >>>>>>>>> I grepped the logs of another host where it happened, but couldn't >>>>>>>>> find any watchtdog messages there. I believe it's also unlikely >>>>>>>>> that suddenly all MDS hosts (we have five active, five hot >>>>>>>>> standbys, and one cold standby) start having hardware issues. I >>>>>>>>> also ran a memtest on one of the hosts last week and couldn't find >>>>>>>>> anything there either. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote: >>>>>>>>>> Curious, are your systems Dells? If so you might see some >>>>>>>>>> improvement from running DSU to update all the firmware. It >>>>>>>>>> might also be illuminating to run `racadm lclog view` >>>>>>>>>> >>>>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>>>>>>>>>> <janek.bevendo...@uni-weimar.de> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Since the latest Reef update I have the problem that some of my >>>>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in >>>>>>>>>>> kernel mode causing all daemons on that host to become >>>>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of >>>>>>>>>>> messages like: >>>>>>>>>>> >>>>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>>>>>>>>>> >>>>>>>>>>> (it's basically a list of all processes running on the machine). >>>>>>>>>>> >>>>>>>>>>> Usually, this resolves itself after several minutes, but >>>>>>>>>>> sometimes I have to hardreset the host. When this happens, all >>>>>>>>>>> daemons are marked as down and I cannot interact with the host >>>>>>>>>>> at all. I don't know what causes this but, I think it happens >>>>>>>>>>> primarily on the hosts where my MDS run and it seems to be >>>>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or >>>>>>>>>>> just randomly. >>>>>>>>>>> >>>>>>>>>>> I found a few reports about similar issues on the bug tracker >>>>>>>>>>> and mailing list, but they are all very unspecific, unanswered, >>>>>>>>>>> or more than 6 years old. >>>>>>>>>>> >>>>>>>>>>> Is there any way I can debug this? I upgraded to Squid already, >>>>>>>>>>> but that didn't solve the problem. I also had massive issues >>>>>>>>>>> with this during the upgrade. Particularly at the end when the >>>>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to >>>>>>>>>>> set the noout flag and then literally sit next to it to resume >>>>>>>>>>> the upgrade every few minutes until it finally went through, >>>>>>>>>>> because random MDS hosts went intermittently dark all the time. >>>>>>>>>>> >>>>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>>>>>>>>>> >>>>>>>>>>> Any ideas? Thanks! >>>>>>>>>>> Janek >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>> -- >>>>>>> Bauhaus-Universität Weimar >>>>>>> Bauhausstr. 9a, R308 >>>>>>> 99423 Weimar, Germany >>>>>>> >>>>>>> Phone: +49 3643 58 3577 >>>>>>> www.webis.de >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io