It's important to note we do not suggest using the SMART "OK" indicator as the 
drive being valid. We monitor correctable/uncorrectable error counts, as you 
can see a dramatic rise when the drives start to fail. 'OK' will be reported 
for SMART health long after the drive is throwing many uncorrectable errors and 
needs replacement. You have to look at the actual counters, themselves.

That said, you will generally see these uncorrectable errors in the kernel 
output from dmesg, as well.

On Mon, Jan 9, 2023, at 16:38, Erik Lindahl wrote:
> Hi,
> 
> We too kept seeing this until a few months ago in a cluster with ~400 HDDs, 
> while all the drive SMART statistics was always A-OK. Since we use erasure 
> coding each PG involves up to 10 HDDs.
> 
> It took us a while to realize we shouldn't expect scrub errors on healthy 
> drives, but eventually we decided to track it down, and found documentation 
> suggesting to use
> 
>  rados list-inconsistent-obj <PG>  --format=json-pretty
> 
> ... before you repair the PG. If you look into that (long) output, you are 
> likely going to find a "read_error" for a specific OSD. Then we started to 
> make a note of the HDD that saw the error.
> 
> This helped us identify two HDDs that had multiple read errors within a few 
> weeks, even though their SMART data was still perfectly fine. Now that 
> *might* just be bad luck, but we have enough drives that we don't care, so we 
> just replaced them, and since then I've only had a single drive report an 
> error.
> 
> One conclusion (in our case) is that it could be a drive that likely would 
> have failed sooner or later, even though it hadn't yet reached a threshold 
> for SMART to worry, or the alternative might be that it's a drive that just 
> has more frequent read errors, but it's technically within the allowed 
> variation. Assuming you have configured your cluster with reasonable 
> redundancy you shouldn't run any risk of data losses, but for us we figured 
> it's worth replacing a few outlier drives to sleep better.
> 
> Cheers,
> 
> Erik
> 
> --
> Erik Lindahl <erik.lind...@gmail.com>
> On 9 Jan 2023 at 23:06 +0100, David Orman <orma...@corenode.com>, wrote:
> > "dmesg" on all the linux hosts and look for signs of failing drives. Look 
> > at smart data, your HBAs/disk controllers, OOB management logs, and so 
> > forth. If you're seeing scrub errors, it's probably a bad disk backing an 
> > OSD or OSDs.
> >
> > Is there a common OSD in the PGs you've run the repairs on?
> >
> > On Mon, Jan 9, 2023, at 03:37, Kuhring, Mathias wrote:
> > > Hey all,
> > >
> > > I'd like to pick up on this topic, since we also see regular scrub
> > > errors recently.
> > > Roughly one per week for around six weeks now.
> > > It's always a different PG and the repair command always helps after a
> > > while.
> > > But the regular re-occurrence seems it bit unsettling.
> > > How to best troubleshoot this.
> > >
> > > We are currently on ceph version 17.2.1
> > > (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
> > >
> > > Best Wishes,
> > > Mathias
> > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to