Re: [ceph-users] ceph health JSON format has changed

Gregory Farnum Tue, 08 Jan 2019 21:51:52 -0800

On Fri, Jan 4, 2019 at 1:19 PM Jan Kasprzak <k...@fi.muni.cz> wrote:
>
> Gregory Farnum wrote:
> : On Wed, Jan 2, 2019 at 5:12 AM Jan Kasprzak <k...@fi.muni.cz> wrote:
> :
> : > Thomas Byrne - UKRI STFC wrote:
> : > : I recently spent some time looking at this, I believe the 'summary' and
> : > : 'overall_status' sections are now deprecated. The 'status' and 'checks'
> : > : fields are the ones to use now.
> : >
> : >         OK, thanks.
> : >
> : > : The 'status' field gives you the OK/WARN/ERR, but returning the most
> : > : severe error condition from the 'checks' section is less trivial. AFAIK
> : > : all health_warn states are treated as equally severe, and same for
> : > : health_err. We ended up formatting our single line human readable output
> : > : as something like:
> : > :
> : > : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN:
> : > 20 large omap objects"
> : >
> : >         Speaking of scrub errors:
> : >
> : >         In previous versions of Ceph, I was able to determine which PGs 
> had
> : > scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> : > provided that they were not already being scrubbed. In Luminous, the bad 
> PG
> : > is not visible in "ceph --status" anywhere. Should I use something like
> : > "ceph health detail -f json-pretty" instead?
> : >
> : >         Also, is it possible to configure Ceph to attempt repairing
> : > the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on
> : > top
> : > of a bunch of old spinning disks, and a scrub error almost always means
> : > that there is a bad sector somewhere, which can easily be fixed by
> : > rewriting the lost data using "ceph pg repair".
> : >
> :
> : It is possible. It's a lot safer than it used to be, but is still NOT
> : RECOMMENDED for replicated pools.
> :
> : But if you are very sure, you can use the options osd_scrub_auto_repair
> : (default: false) and osd_scrub_auto_repair_num_errors (default:5, which
> : will not auto-repair if scrub detects more errors than that value) to
> : configure it.
>
>         OK, thanks. I just want to say that I am NOT very sure,
> but this is about the only way I am aware of, when I want to
> handle the scrub error. I have mail notification set up in smartd.conf,
> and so far the scrub errors seem to correlate with new reallocated
> or pending sectors.
>
>         What are the drawbacks of running "ceph pg repair" as soon
> asi the cluster enters the HEALTH_ERR state with scrub error?


I think there are still some rare cases where it's possible that Ceph
chooses the wrong copy as the authoritative one. The windows keep
getting smaller, though, and if you're just running repair all the
time anyway then having the system do it automatically obviously isn't
worse. :)
-Greg

>
>         Thanks for explanation,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
> | http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph health JSON format has changed

Reply via email to