> 
> As far as I know, the hardware sector remap has no standard utility
> available to add failing sectors and the badblock list is something
> that would have to be used by the specific filesystem driver in
> question.

Modern drives generally do this for us, unlike the C/H/S days of e.g. SMD where 
one would manually slip a new offender.

> 
> By all appearances, SMART continues to test all sectors, regardless of
> OS usage, since it runs internal to the drive itself and thus knows
> nothing of the OS. So it would flag sectors that have been walled off
> from actual use.

Same with an HBA patrol read.

> 
> So far, the only SMART errors I've been annoyed by have been in OS
> partitions and never on an OSD, so I don't know just how closely SMART
> and Ceph interact.
> 
> My bad sector messages (usually no more than 3 for a given drive)
> typically run for up to 3 years without further degradation and since I
> have both robust data storage (e.g., Ceph) and robust backups, I'm
> happy to run drives into the ground. It's not like I'm being paid a
> handsome sum to keep everything flawless.

I agree that 3 grown defects is nothing to get excited about.  Maybe an action 
threshold would be 1 per TB?

> 
>   Tim
> 
> On Thu, 2025-08-21 at 16:04 -0400, Anthony D'Atri wrote:
>> 
>>> I rely on SMART for 2 things:
>> 
>> Bit of nomenclature clarification:
>> 
>> SMART is a mechanism for interacting with storage devices, mostly but
>> not exclusively reading status and metrics
>> smartctl is a CLI utility
>> smartd is a daemon
>> smartmontools is a package that includes smartctl and smartd
>> 
>> 
>>> 1. Repeatedly sending me messages about nonfatal bad sectors that
>>> no one seems to know how to correct for.
>> 
>> That sounds like you have smartd running and configured to send
>> email?  I personally haven't found value in smartd; ymmv.
>> 
>> Drives sometimes encounter failed writes that result in a grown
>> defect and mapping of the subject LBA.  When that LBA is written
>> again, it generally succeeds.  Since Nautilus the OSD will retry a
>> certain number of writes, and as a result we see a lot fewer
>> inconsistent PGs than we used to.
>> 
>> When a drive reports grown errors, those are worth tracking.  Every
>> drive has some number of factory bad blocks that are remapped out of
>> the box.  A portion of individual HDDs will develop additional grown
>> defects over its lifetime.  A few of these are not cause for alarm, a
>> lot of them IMHO is cause to replace the drive.  The threshold for "a
>> lot" is not clear, perhaps somewhere in the 5-10 range?
>> 
>> SSDs will report reallocated blocks and in some cases the numbers of
>> spares used/remaining.  It is worth tracking these and alerting if a
>> drive is running short on spares or experiences a high rate of new
>> reallocated blocks.
>> 
>>> 
>>> 2. Not saying anything before a device crashes.
>>> 
>>> Yeah. But I run it anyway because you never know.
>> 
>> The overall health status?  Yeah that usually has limited utility. 
>> I've seen plenty of drives that are clearly throwing errors yet
>> report healthy.  Arguably a firmware bug.
>> 
>>> The reported error is too far abstracted from the actual failure
>>> and I cannot find anything about -22 as a SMART result code.
>> 
>> I'm pretty sure that's an errno number for launching the subprocess,
>> nothing to do with SMART itself.  I'd check dmesg and
>> /var/log/{messages, syslog} to see if anything was reported for that
>> drive.  If the drive is SAS/SATA and hung off an LSI HBA, also try
>> `storcli64 /c0 show termlog >/var/tmp/termlog.txt`
>> 
>> 
>>> *n*x errno 22 is EINVAL, which seems unlikely, but it is possible
>>> that smartd got misconfigured.
>> 
>> Best configuration IMHO is to stop, disable, and mask the service.
>> 
>>> 
>>> Run smartctl -l /dev/sdc to launch an out-of-band long test. When
>>> it is done, use smartctl to report the results and see if anything
>>> is flagged.
>>> 
>>> On 8/21/25 09:10, Anthony D'Atri wrote:
>>>> 
>>>>> On Aug 21, 2025, at 4:07 AM, Miles Goodhew <c...@m0les.com>
>>>>> wrote:
>>>>> 
>>>>> Hi Robert,
>>>>>  I'm not an expert on the low-level details and "modern" Ceph,
>>>>> so I hope I don't lead you on any wild goose chases, but I
>>>>> might at least give some leads.
>>>>>  It seems odd that the metrics mention NVM/e - I'm guessing
>>>>> that it's just a cross-product test and tries all tools on all
>>>>> devices.
>>>> Recent releases of smartctl pass through stats for NVMe devices
>>>> via the name-cli command "nvme".  Whether it invokes that for all
>>>> devices, ordering, etc I don't know.
>>>> 
>>>> 
>>>>> SMART test failure is more of an issue. It's a pity the error
>>>>> message is so nondescript. Some things I can think of from
>>>>> simplest to most complicated are:
>>>>> * Are smartmontools installed on the drive host?
>>>> Does it happen with other drives on the same host?
>>>> 
>>>> If you have availability through your chassis vendor, look for a
>>>> firmware update.
>>>> 
>>>>> * Does the monitoring UID have sudo access?
>>>>> * Does a manual "sudo smartctl -a /dev/sdc" give the same or
>>>>> similar result?
>>>>> * Is the drive managed by a hardware RAID controller or
>>>>> concentrator (Like Dell PERC or a USB adapter or something)
>>>>> * (This is a stretch) Is there an OSD for the drive that's
>>>>> given the "NVME" class?
>>>>> 
>>>>> Hope that gives you something.
>>>>> 
>>>>> M0les.
>>>>> 
>>>>> 
>>>>> On Thu, 21 Aug 2025, at 17:15, Robert Sander wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> On a new cluster with version 19.2.3 the device health
>>>>>> metrics only show a smartctl error:
>>>>>> 
>>>>>> {
>>>>>>     "20250821-000313": {
>>>>>>         "dev": "/dev/sdc",
>>>>>>         "error": "smartctl failed",
>>>>>>         "nvme_smart_health_information_add_log_error": "nvme
>>>>>> returned an error: sudo: exit status: 1",
>>>>>>         "nvme_smart_health_information_add_log_error_code": -
>>>>>> 22,
>>>>>>         "nvme_vendor": "ata",
>>>>>>         "smartctl_error_code": -22,
>>>>>>         "smartctl_output": "smartctl returned an error (1):
>>>>>> stderr:\nsudo: exit status: 1\nstdout:\n"
>>>>>>     }
>>>>>> }
>>>>>> 
>>>>>> The device in question (like all the other in the cluster) is
>>>>>> a Samsung MZ7L37T6 SATA SSD.
>>>>>> 
>>>>>> What is happening here?
>>>>>> 
>>>>>> Regards
>>>>>> -- 
>>>>>> Robert Sander
>>>>>> Linux Consultant
>>>>>> 
>>>>>> Heinlein Consulting GmbH
>>>>>> Schwedter Str. 8/9b, 10119 Berlin
>>>>>> 
>>>>>> https://www.heinlein-support.de
>>>>>> 
>>>>>> Tel: +49 30 405051 - 0
>>>>>> Fax: +49 30 405051 - 19
>>>>>> 
>>>>>> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
>>>>>> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to