> I rely on SMART for 2 things:

Bit of nomenclature clarification:

SMART is a mechanism for interacting with storage devices, mostly but not 
exclusively reading status and metrics
smartctl is a CLI utility
smartd is a daemon
smartmontools is a package that includes smartctl and smartd


> 1. Repeatedly sending me messages about nonfatal bad sectors that no one 
> seems to know how to correct for.

That sounds like you have smartd running and configured to send email?  I 
personally haven't found value in smartd; ymmv.

Drives sometimes encounter failed writes that result in a grown defect and 
mapping of the subject LBA.  When that LBA is written again, it generally 
succeeds.  Since Nautilus the OSD will retry a certain number of writes, and as 
a result we see a lot fewer inconsistent PGs than we used to.

When a drive reports grown errors, those are worth tracking.  Every drive has 
some number of factory bad blocks that are remapped out of the box.  A portion 
of individual HDDs will develop additional grown defects over its lifetime.  A 
few of these are not cause for alarm, a lot of them IMHO is cause to replace 
the drive.  The threshold for "a lot" is not clear, perhaps somewhere in the 
5-10 range?

SSDs will report reallocated blocks and in some cases the numbers of spares 
used/remaining.  It is worth tracking these and alerting if a drive is running 
short on spares or experiences a high rate of new reallocated blocks.

> 
> 2. Not saying anything before a device crashes.
> 
> Yeah. But I run it anyway because you never know.

The overall health status?  Yeah that usually has limited utility.  I've seen 
plenty of drives that are clearly throwing errors yet report healthy.  Arguably 
a firmware bug.

> The reported error is too far abstracted from the actual failure and I cannot 
> find anything about -22 as a SMART result code.

I'm pretty sure that's an errno number for launching the subprocess, nothing to 
do with SMART itself.  I'd check dmesg and /var/log/{messages, syslog} to see 
if anything was reported for that drive.  If the drive is SAS/SATA and hung off 
an LSI HBA, also try `storcli64 /c0 show termlog >/var/tmp/termlog.txt`


> *n*x errno 22 is EINVAL, which seems unlikely, but it is possible that smartd 
> got misconfigured.

Best configuration IMHO is to stop, disable, and mask the service.

> 
> Run smartctl -l /dev/sdc to launch an out-of-band long test. When it is done, 
> use smartctl to report the results and see if anything is flagged.
> 
> On 8/21/25 09:10, Anthony D'Atri wrote:
>> 
>>> On Aug 21, 2025, at 4:07 AM, Miles Goodhew <c...@m0les.com> wrote:
>>> 
>>> Hi Robert,
>>>  I'm not an expert on the low-level details and "modern" Ceph, so I hope I 
>>> don't lead you on any wild goose chases, but I might at least give some 
>>> leads.
>>>  It seems odd that the metrics mention NVM/e - I'm guessing that it's just 
>>> a cross-product test and tries all tools on all devices.
>> Recent releases of smartctl pass through stats for NVMe devices via the 
>> name-cli command "nvme".  Whether it invokes that for all devices, ordering, 
>> etc I don't know.
>> 
>> 
>>> SMART test failure is more of an issue. It's a pity the error message is so 
>>> nondescript. Some things I can think of from simplest to most complicated 
>>> are:
>>> * Are smartmontools installed on the drive host?
>> Does it happen with other drives on the same host?
>> 
>> If you have availability through your chassis vendor, look for a firmware 
>> update.
>> 
>>> * Does the monitoring UID have sudo access?
>>> * Does a manual "sudo smartctl -a /dev/sdc" give the same or similar result?
>>> * Is the drive managed by a hardware RAID controller or concentrator (Like 
>>> Dell PERC or a USB adapter or something)
>>> * (This is a stretch) Is there an OSD for the drive that's given the "NVME" 
>>> class?
>>> 
>>> Hope that gives you something.
>>> 
>>> M0les.
>>> 
>>> 
>>> On Thu, 21 Aug 2025, at 17:15, Robert Sander wrote:
>>>> Hi,
>>>> 
>>>> On a new cluster with version 19.2.3 the device health metrics only show a 
>>>> smartctl error:
>>>> 
>>>> {
>>>>     "20250821-000313": {
>>>>         "dev": "/dev/sdc",
>>>>         "error": "smartctl failed",
>>>>         "nvme_smart_health_information_add_log_error": "nvme returned an 
>>>> error: sudo: exit status: 1",
>>>>         "nvme_smart_health_information_add_log_error_code": -22,
>>>>         "nvme_vendor": "ata",
>>>>         "smartctl_error_code": -22,
>>>>         "smartctl_output": "smartctl returned an error (1): stderr:\nsudo: 
>>>> exit status: 1\nstdout:\n"
>>>>     }
>>>> }
>>>> 
>>>> The device in question (like all the other in the cluster) is a Samsung 
>>>> MZ7L37T6 SATA SSD.
>>>> 
>>>> What is happening here?
>>>> 
>>>> Regards
>>>> -- 
>>>> Robert Sander
>>>> Linux Consultant
>>>> 
>>>> Heinlein Consulting GmbH
>>>> Schwedter Str. 8/9b, 10119 Berlin
>>>> 
>>>> https://www.heinlein-support.de
>>>> 
>>>> Tel: +49 30 405051 - 0
>>>> Fax: +49 30 405051 - 19
>>>> 
>>>> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
>>>> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to