[ceph-users] Re: "BUG: soft lockup" with MDS

Frédéric Nass Mon, 12 May 2025 09:11:31 -0700

Hi Janek,

I just checked and we upgraded both HDD and SSD firmwares to those versions 
released last month.


HDD firmware (DELL/Seagate 'ST16000NM006J'): 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=xf65r&lwp=rt
NVMe firmware (DELL/Kioxia 'Dell Ent NVMe CM7 U.2 RI 1.92TB'): 
https://www.dell.com/support/home/en-us/drivers/DriversDetails?driverID=VH1YP&lwp=rt

What models are your drives?

Regards,
Frédéric.

----- Le 12 Mai 25, à 9:37, Janek Bevendorff janek.bevendo...@uni-weimar.de a 
écrit :

> Hi all,
> 
> Kernel is 6.8.0 (Ubuntu). The thermal settings on iDrac are already
> quite high and we have an overall good cooling system, so that shouldn't
> cause any issues. Our cold aisle is around 20˚C.
> 
> @Frédéric Do you have a link to the HDD firmware? I installed everything
> that's available in the Dell catalogue. Also, I don't know whether CPU
> usage is high during these lockups, since I cannot observe the host
> state when it happens. It's like the entire node goes down until either
> it recovers on its own or I do a hardreset.
> 
> Janek
> 
> 
> On 09/05/2025 15:08, Anthony D'Atri wrote:
>> There are separate thermal, overall performance, and fan states in iDRAC.  
>> I’ve
>> found that I often have to bump up the default “fan offset” for more cooling.
>>
>>> On May 9, 2025, at 8:51 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr>
>>> wrote:
>>>
>>> Hi Janek,
>>>
>>> We just had a very similar issue with recent hardware (DELL R760xd) going 
>>> nuts
>>> (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' as 
>>> not
>>> responding in time.
>>>
>>> Switching the CPU profile to HPC (High Performance Computing) and the 
>>> thermal
>>> settings to Maximum Performance (or is it Optimized?) in BIOS, and upgrading
>>> HDD firmware to the latest one taht was only available from DELL's website 
>>> (not
>>> yet in OpenManage catalog) fixed it.
>>>
>>> Maybe you can give it a try.
>>>
>>> Regards,
>>> Frédéric.
>>>
>>> ----- Le 9 Mai 25, à 9:02, Janek Bevendorff janek.bevendo...@uni-weimar.de a
>>> écrit :
>>>
>>>> Hi, it's happening again. I haven't fully upgraded the firmware on all
>>>> hosts yet, but at least on all MDS. I managed to finish the Ceph
>>>> upgrade, but now I'm randomly getting the soft lockups again (mostly,
>>>> but not only) on the MDS hosts.
>>>>
>>>> Anything else I could check for?
>>>>
>>>> Janek
>>>>
>>>>
>>>> On 16/04/2025 17:38, Janek Bevendorff wrote:
>>>>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers
>>>>> can check what they need. There are three updates in total: BIOS, NIC,
>>>>> and Lifecycle Controller.
>>>>>
>>>>> I hope the BIOS update fixes this.
>>>>>
>>>>>
>>>>> On 16/04/2025 17:16, Anthony D'Atri wrote:
>>>>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous
>>>>>> at the time.  BIOS updates inherently require a reboot.  Check for
>>>>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model
>>>>>> had at least one update after FCS.
>>>>>>
>>>>>>
>>>>>>> The servers our Ceph runs on are all R730xd machines.
>>>>>>>
>>>>>>> I checked the Dell repository manager and it looks like there is at
>>>>>>> least one BIOS update that's newer than what we've already
>>>>>>> installed, so I've updated our Firmware repository and will schedule
>>>>>>> the updates now. That's going to take a long while.
>>>>>>>
>>>>>>>
>>>>>>> On 16/04/2025 16:16, Anthony D'Atri wrote:
>>>>>>>> For whatever reason, in recent years I’ve seen these more often
>>>>>>>> with Dells than other systems. My first thought was that maybe you
>>>>>>>> were running an ancient kernel, but then I saw that you aren’t.  Is
>>>>>>>> the kernel you’re running the stock one that comes with your
>>>>>>>> distribution?  I’ve seen CPU reset events on R750s running an
>>>>>>>> elrepo kernel.
>>>>>>>>
>>>>>>>> I suspect that some code change may have tickled a latent issue
>>>>>>>> that perhaps you were fortunate to have not previously run into,
>>>>>>>> but this is entirely speculation.
>>>>>>>>
>>>>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff
>>>>>>>>> <janek.bevendo...@uni-weimar.de> wrote:
>>>>>>>>>
>>>>>>>>> Yes, they are older Dell PowerEdges. I have to check whether
>>>>>>>>> there's newer firmware, but we've been running Ceph for years
>>>>>>>>> without these problems.
>>>>>>>>>
>>>>>>>>> I checked the logs on the host on which I had a lockup just an
>>>>>>>>> hour ago, but there's nothing besides the expected hardreset
>>>>>>>>> messages. There are two older watchdog messages, but they are from
>>>>>>>>> March:
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> SeqNumber       = 2089
>>>>>>>>> Message ID      = ASR0000
>>>>>>>>> Category        = System
>>>>>>>>> AgentID         = SEL
>>>>>>>>> Severity        = Critical
>>>>>>>>> Timestamp       = 2025-03-27 07:16:03
>>>>>>>>> Message         = The watchdog timer expired.
>>>>>>>>> RawEventData    =
>>>>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>>>>
>>>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> SeqNumber       = 2088
>>>>>>>>> Message ID      = ASR0000
>>>>>>>>> Category        = System
>>>>>>>>> AgentID         = SEL
>>>>>>>>> Severity        = Critical
>>>>>>>>> Timestamp       = 2025-03-27 07:06:41
>>>>>>>>> Message         = The watchdog timer expired.
>>>>>>>>> RawEventData    =
>>>>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>>>>
>>>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I grepped the logs of another host where it happened, but couldn't
>>>>>>>>> find any watchtdog messages there. I believe it's also unlikely
>>>>>>>>> that suddenly all MDS hosts (we have five active, five hot
>>>>>>>>> standbys, and one cold standby) start having hardware issues. I
>>>>>>>>> also ran a memtest on one of the hosts last week and couldn't find
>>>>>>>>> anything there either.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote:
>>>>>>>>>> Curious, are your systems Dells? If so you might see some
>>>>>>>>>> improvement from running DSU to update all the firmware.  It
>>>>>>>>>> might also be illuminating to run `racadm lclog view`
>>>>>>>>>>
>>>>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff
>>>>>>>>>>> <janek.bevendo...@uni-weimar.de> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Since the latest Reef update I have the problem that some of my
>>>>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in
>>>>>>>>>>> kernel mode causing all daemons on that host to become
>>>>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of
>>>>>>>>>>> messages like:
>>>>>>>>>>>
>>>>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>>>>>>>>>>
>>>>>>>>>>> (it's basically a list of all processes running on the machine).
>>>>>>>>>>>
>>>>>>>>>>> Usually, this resolves itself after several minutes, but
>>>>>>>>>>> sometimes I have to hardreset the host. When this happens, all
>>>>>>>>>>> daemons are marked as down and I cannot interact with the host
>>>>>>>>>>> at all. I don't know what causes this but, I think it happens
>>>>>>>>>>> primarily on the hosts where my MDS run and it seems to be
>>>>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or
>>>>>>>>>>> just randomly.
>>>>>>>>>>>
>>>>>>>>>>> I found a few reports about similar issues on the bug tracker
>>>>>>>>>>> and mailing list, but they are all very unspecific, unanswered,
>>>>>>>>>>> or more than 6 years old.
>>>>>>>>>>>
>>>>>>>>>>> Is there any way I can debug this? I upgraded to Squid already,
>>>>>>>>>>> but that didn't solve the problem. I also had massive issues
>>>>>>>>>>> with this during the upgrade. Particularly at the end when the
>>>>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to
>>>>>>>>>>> set the noout flag and then literally sit next to it to resume
>>>>>>>>>>> the upgrade every few minutes until it finally went through,
>>>>>>>>>>> because random MDS hosts went intermittently dark all the time.
>>>>>>>>>>>
>>>>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas? Thanks!
>>>>>>>>>>> Janek
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>> --
>>>>>>> Bauhaus-Universität Weimar
>>>>>>> Bauhausstr. 9a, R308
>>>>>>> 99423 Weimar, Germany
>>>>>>>
>>>>>>> Phone: +49 3643 58 3577
>>>>>>> www.webis.de
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to