Re: [ceph-users] Random Health_warn
That sounds about right, I do see blocked requests sometimes when it is under really heavy load. Looking at some examples I think summary should list the issues. "summary": [], "overall_status": "HEALTH_OK", I'll try logging that too. Scott On Thu, Feb 23, 2017 at 3:00 PM David Turner <david.tur...@storagecraft.com> wrote: > There are multiple approaches to give you more information about the > Health state. CLI has these 2 options: > ceph health detail > ceph status > > I also like using ceph-dash. ( https://github.com/Crapworks/ceph-dash ) > It has an associated nagios check to scrape the ceph-dash page. > > I personally do `watch ceph status` when I'm monitoring the cluster > closely. It will show you things like blocked requests, osds flapping, mon > clock skew, or whatever your problem is causing the health_warn state. The > most likely cause for health_warn off and on is blocked requests. Those > are caused by any number of things that you would need to diagnose further > if that is what is causing the health_warn state. > > -- > > <https://storagecraft.com> David Turner | Cloud Operations Engineer | > StorageCraft > Technology Corporation <https://storagecraft.com> > 380 Data Drive Suite 300 | Draper | Utah | 84020 > Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943 > <(385)%20224-2943> > > -- > > If you are not the intended recipient of this message or received it > erroneously, please notify the sender and delete it, together with any > attachments, and be advised that any dissemination or copying of this > message is prohibited. > -- > > > From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John > Spray [jsp...@redhat.com] > Sent: Thursday, February 23, 2017 3:47 PM > To: Scottix > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Random Health_warn > > > On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scot...@gmail.com> wrote: > > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > > > We are seeing a weird behavior or not sure how to diagnose what could be > > going on. We started monitoring the overall_status from the json query > and > > every once in a while we would get a HEALTH_WARN for a minute or two. > > > > Monitoring logs. > > 02/23/2017 07:25:54 AM HEALTH_OK > > 02/23/2017 07:24:54 AM HEALTH_WARN > > 02/23/2017 07:23:55 AM HEALTH_OK > > 02/23/2017 07:22:54 AM HEALTH_OK > > ... > > 02/23/2017 05:13:55 AM HEALTH_OK > > 02/23/2017 05:12:54 AM HEALTH_WARN > > 02/23/2017 05:11:54 AM HEALTH_WARN > > 02/23/2017 05:10:54 AM HEALTH_OK > > 02/23/2017 05:09:54 AM HEALTH_OK > > > > When I check the mon leader logs there is no indication of an error or > > issues that could be occuring. Is there a way to find what is causing the > > HEALTH_WARN? > > Possibly not without grabbing more than just the overall status at the > same time as you're grabbing the OK/WARN status. > > Internally, the OK/WARN/ERROR health state is generated on-demand by > applying a bunch of checks to the state of the system when the user > runs the health command -- the system doesn't know it's in a warning > state until it's asked. Often you will see a corresponding log > message, but not necessarily. > > John > > > Best, > > Scott > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random Health_warn
There are multiple approaches to give you more information about the Health state. CLI has these 2 options: ceph health detail ceph status I also like using ceph-dash. ( https://github.com/Crapworks/ceph-dash ) It has an associated nagios check to scrape the ceph-dash page. I personally do `watch ceph status` when I'm monitoring the cluster closely. It will show you things like blocked requests, osds flapping, mon clock skew, or whatever your problem is causing the health_warn state. The most likely cause for health_warn off and on is blocked requests. Those are caused by any number of things that you would need to diagnose further if that is what is causing the health_warn state. [cid:image52c4b1.JPG@3ecb414b.49abf25e]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2760 | Mobile: 385.224.2943 If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John Spray [jsp...@redhat.com] Sent: Thursday, February 23, 2017 3:47 PM To: Scottix Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Random Health_warn On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scot...@gmail.com> wrote: > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > We are seeing a weird behavior or not sure how to diagnose what could be > going on. We started monitoring the overall_status from the json query and > every once in a while we would get a HEALTH_WARN for a minute or two. > > Monitoring logs. > 02/23/2017 07:25:54 AM HEALTH_OK > 02/23/2017 07:24:54 AM HEALTH_WARN > 02/23/2017 07:23:55 AM HEALTH_OK > 02/23/2017 07:22:54 AM HEALTH_OK > ... > 02/23/2017 05:13:55 AM HEALTH_OK > 02/23/2017 05:12:54 AM HEALTH_WARN > 02/23/2017 05:11:54 AM HEALTH_WARN > 02/23/2017 05:10:54 AM HEALTH_OK > 02/23/2017 05:09:54 AM HEALTH_OK > > When I check the mon leader logs there is no indication of an error or > issues that could be occuring. Is there a way to find what is causing the > HEALTH_WARN? Possibly not without grabbing more than just the overall status at the same time as you're grabbing the OK/WARN status. Internally, the OK/WARN/ERROR health state is generated on-demand by applying a bunch of checks to the state of the system when the user runs the health command -- the system doesn't know it's in a warning state until it's asked. Often you will see a corresponding log message, but not necessarily. John > Best, > Scott > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random Health_warn
On Thu, Feb 23, 2017 at 10:40:31PM +, Scottix wrote: > Ya the ceph-mon.$ID.log > > I was running ceph -w when one of them occurred too and it never output > anything. > > Here is a snippet for the the 5:11AM occurrence. Yep, I don't see anything in there that should have triggered HEALTH_WARN. All I can suggest is dumping the JSON health blob when it occurs again, and seeing if anything stands out in it. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random Health_warn
On Thu, Feb 23, 2017 at 9:49 PM, Scottixwrote: > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > We are seeing a weird behavior or not sure how to diagnose what could be > going on. We started monitoring the overall_status from the json query and > every once in a while we would get a HEALTH_WARN for a minute or two. > > Monitoring logs. > 02/23/2017 07:25:54 AM HEALTH_OK > 02/23/2017 07:24:54 AM HEALTH_WARN > 02/23/2017 07:23:55 AM HEALTH_OK > 02/23/2017 07:22:54 AM HEALTH_OK > ... > 02/23/2017 05:13:55 AM HEALTH_OK > 02/23/2017 05:12:54 AM HEALTH_WARN > 02/23/2017 05:11:54 AM HEALTH_WARN > 02/23/2017 05:10:54 AM HEALTH_OK > 02/23/2017 05:09:54 AM HEALTH_OK > > When I check the mon leader logs there is no indication of an error or > issues that could be occuring. Is there a way to find what is causing the > HEALTH_WARN? Possibly not without grabbing more than just the overall status at the same time as you're grabbing the OK/WARN status. Internally, the OK/WARN/ERROR health state is generated on-demand by applying a bunch of checks to the state of the system when the user runs the health command -- the system doesn't know it's in a warning state until it's asked. Often you will see a corresponding log message, but not necessarily. John > Best, > Scott > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random Health_warn
Ya the ceph-mon.$ID.log I was running ceph -w when one of them occurred too and it never output anything. Here is a snippet for the the 5:11AM occurrence. On Thu, Feb 23, 2017 at 1:56 PM Robin H. Johnsonwrote: > On Thu, Feb 23, 2017 at 09:49:21PM +, Scottix wrote: > > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > > > We are seeing a weird behavior or not sure how to diagnose what could be > > going on. We started monitoring the overall_status from the json query > and > > every once in a while we would get a HEALTH_WARN for a minute or two. > > > > Monitoring logs. > > 02/23/2017 07:25:54 AM HEALTH_OK > > 02/23/2017 07:24:54 AM HEALTH_WARN > > 02/23/2017 07:23:55 AM HEALTH_OK > > 02/23/2017 07:22:54 AM HEALTH_OK > > ... > > 02/23/2017 05:13:55 AM HEALTH_OK > > 02/23/2017 05:12:54 AM HEALTH_WARN > > 02/23/2017 05:11:54 AM HEALTH_WARN > > 02/23/2017 05:10:54 AM HEALTH_OK > > 02/23/2017 05:09:54 AM HEALTH_OK > > > > When I check the mon leader logs there is no indication of an error or > > issues that could be occuring. Is there a way to find what is causing the > > HEALTH_WARN? > By leader logs, do you mean the cluster log (mon_cluster_log_file), or > the mon log (log_file)? Eg /var/log/ceph/ceph.log vs > /var/log/ceph/ceph-mon.$ID.log. > > Could you post the log entries for a time period between two HEALTH_OK > states with a HEALTH_WARN in the middle? > > The reason for WARN _should_ be included on the logged status line. > > Alternatively, you should be able to just log the output of 'ceph -w' > for a while, and find the WARN status as well. > > -- > Robin Hugh Johnson > Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer > E-Mail : robb...@gentoo.org > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > 2017-02-23 05:10:54.139358 7f5c17894700 0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1 2017-02-23 05:10:54.139549 7f5c17894700 0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031767' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch 2017-02-23 05:10:54.535319 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496604: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 16681 kB/s rd, 11886 kB/s wr, 705 op/s 2017-02-23 05:10:55.600104 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496605: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14716 kB/s rd, 6627 kB/s wr, 408 op/s 2017-02-23 05:10:56.170435 7f5c17894700 0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1 2017-02-23 05:10:56.170502 7f5c17894700 0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031899' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch 2017-02-23 05:10:56.642040 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496606: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14617 kB/s rd, 6580 kB/s wr, 537 op/s 2017-02-23 05:10:57.667496 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496607: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 8862 kB/s rd, 7126 kB/s wr, 552 op/s 2017-02-23 05:10:58.736114 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496608: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14126 kB/s rd, 11254 kB/s wr, 974 op/s 2017-02-23 05:10:59.451884 7f5c17894700 0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1 2017-02-23 05:10:59.451903 7f5c17894700 0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031932' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch 2017-02-23 05:10:59.812909 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496609: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 11238 kB/s rd, 8236 kB/s wr, 785 op/s 2017-02-23 05:11:00.829329 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496610: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 6193 kB/s rd, 7345 kB/s wr, 186 op/s 2017-02-23 05:11:01.850120 7f5c1a25c700 0 log_channel(cluster) log [INF] : pgmap v77496611: 5120
Re: [ceph-users] Random Health_warn
On Thu, Feb 23, 2017 at 09:49:21PM +, Scottix wrote: > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > We are seeing a weird behavior or not sure how to diagnose what could be > going on. We started monitoring the overall_status from the json query and > every once in a while we would get a HEALTH_WARN for a minute or two. > > Monitoring logs. > 02/23/2017 07:25:54 AM HEALTH_OK > 02/23/2017 07:24:54 AM HEALTH_WARN > 02/23/2017 07:23:55 AM HEALTH_OK > 02/23/2017 07:22:54 AM HEALTH_OK > ... > 02/23/2017 05:13:55 AM HEALTH_OK > 02/23/2017 05:12:54 AM HEALTH_WARN > 02/23/2017 05:11:54 AM HEALTH_WARN > 02/23/2017 05:10:54 AM HEALTH_OK > 02/23/2017 05:09:54 AM HEALTH_OK > > When I check the mon leader logs there is no indication of an error or > issues that could be occuring. Is there a way to find what is causing the > HEALTH_WARN? By leader logs, do you mean the cluster log (mon_cluster_log_file), or the mon log (log_file)? Eg /var/log/ceph/ceph.log vs /var/log/ceph/ceph-mon.$ID.log. Could you post the log entries for a time period between two HEALTH_OK states with a HEALTH_WARN in the middle? The reason for WARN _should_ be included on the logged status line. Alternatively, you should be able to just log the output of 'ceph -w' for a while, and find the WARN status as well. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Random Health_warn
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) We are seeing a weird behavior or not sure how to diagnose what could be going on. We started monitoring the overall_status from the json query and every once in a while we would get a HEALTH_WARN for a minute or two. Monitoring logs. 02/23/2017 07:25:54 AM HEALTH_OK 02/23/2017 07:24:54 AM HEALTH_WARN 02/23/2017 07:23:55 AM HEALTH_OK 02/23/2017 07:22:54 AM HEALTH_OK ... 02/23/2017 05:13:55 AM HEALTH_OK 02/23/2017 05:12:54 AM HEALTH_WARN 02/23/2017 05:11:54 AM HEALTH_WARN 02/23/2017 05:10:54 AM HEALTH_OK 02/23/2017 05:09:54 AM HEALTH_OK When I check the mon leader logs there is no indication of an error or issues that could be occuring. Is there a way to find what is causing the HEALTH_WARN? Best, Scott ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com