Re: [ceph-users] Random Health_warn

2017-02-23 Thread Scottix
That sounds about right, I do see blocked requests sometimes when it is
under really heavy load.

Looking at some examples I think summary should list the issues.
"summary": [],
"overall_status": "HEALTH_OK",

I'll try logging that too.

Scott

On Thu, Feb 23, 2017 at 3:00 PM David Turner <david.tur...@storagecraft.com>
wrote:

> There are multiple approaches to give you more information about the
> Health state.  CLI has these 2 options:
> ceph health detail
> ceph status
>
> I also like using ceph-dash.  ( https://github.com/Crapworks/ceph-dash )
>  It has an associated nagios check to scrape the ceph-dash page.
>
> I personally do `watch ceph status` when I'm monitoring the cluster
> closely.  It will show you things like blocked requests, osds flapping, mon
> clock skew, or whatever your problem is causing the health_warn state.  The
> most likely cause for health_warn off and on is blocked requests.  Those
> are caused by any number of things that you would need to diagnose further
> if that is what is causing the health_warn state.
>
> --
>
> <https://storagecraft.com> David Turner | Cloud Operations Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 <(801)%20871-2760> | Mobile: 385.224.2943
> <(385)%20224-2943>
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
> --
>
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John
> Spray [jsp...@redhat.com]
> Sent: Thursday, February 23, 2017 3:47 PM
> To: Scottix
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Random Health_warn
>
>
> On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scot...@gmail.com> wrote:
> > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
> >
> > We are seeing a weird behavior or not sure how to diagnose what could be
> > going on. We started monitoring the overall_status from the json query
> and
> > every once in a while we would get a HEALTH_WARN for a minute or two.
> >
> > Monitoring logs.
> > 02/23/2017 07:25:54 AM HEALTH_OK
> > 02/23/2017 07:24:54 AM HEALTH_WARN
> > 02/23/2017 07:23:55 AM HEALTH_OK
> > 02/23/2017 07:22:54 AM HEALTH_OK
> > ...
> > 02/23/2017 05:13:55 AM HEALTH_OK
> > 02/23/2017 05:12:54 AM HEALTH_WARN
> > 02/23/2017 05:11:54 AM HEALTH_WARN
> > 02/23/2017 05:10:54 AM HEALTH_OK
> > 02/23/2017 05:09:54 AM HEALTH_OK
> >
> > When I check the mon leader logs there is no indication of an error or
> > issues that could be occuring. Is there a way to find what is causing the
> > HEALTH_WARN?
>
> Possibly not without grabbing more than just the overall status at the
> same time as you're grabbing the OK/WARN status.
>
> Internally, the OK/WARN/ERROR health state is generated on-demand by
> applying a bunch of checks to the state of the system when the user
> runs the health command -- the system doesn't know it's in a warning
> state until it's asked.  Often you will see a corresponding log
> message, but not necessarily.
>
> John
>
> > Best,
> > Scott
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random Health_warn

2017-02-23 Thread David Turner
There are multiple approaches to give you more information about the Health 
state.  CLI has these 2 options:
ceph health detail
ceph status

I also like using ceph-dash.  ( https://github.com/Crapworks/ceph-dash )  It 
has an associated nagios check to scrape the ceph-dash page.

I personally do `watch ceph status` when I'm monitoring the cluster closely.  
It will show you things like blocked requests, osds flapping, mon clock skew, 
or whatever your problem is causing the health_warn state.  The most likely 
cause for health_warn off and on is blocked requests.  Those are caused by any 
number of things that you would need to diagnose further if that is what is 
causing the health_warn state.



[cid:image52c4b1.JPG@3ecb414b.49abf25e]<https://storagecraft.com>   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John Spray 
[jsp...@redhat.com]
Sent: Thursday, February 23, 2017 3:47 PM
To: Scottix
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Random Health_warn

On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scot...@gmail.com> wrote:
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> We are seeing a weird behavior or not sure how to diagnose what could be
> going on. We started monitoring the overall_status from the json query and
> every once in a while we would get a HEALTH_WARN for a minute or two.
>
> Monitoring logs.
> 02/23/2017 07:25:54 AM HEALTH_OK
> 02/23/2017 07:24:54 AM HEALTH_WARN
> 02/23/2017 07:23:55 AM HEALTH_OK
> 02/23/2017 07:22:54 AM HEALTH_OK
> ...
> 02/23/2017 05:13:55 AM HEALTH_OK
> 02/23/2017 05:12:54 AM HEALTH_WARN
> 02/23/2017 05:11:54 AM HEALTH_WARN
> 02/23/2017 05:10:54 AM HEALTH_OK
> 02/23/2017 05:09:54 AM HEALTH_OK
>
> When I check the mon leader logs there is no indication of an error or
> issues that could be occuring. Is there a way to find what is causing the
> HEALTH_WARN?

Possibly not without grabbing more than just the overall status at the
same time as you're grabbing the OK/WARN status.

Internally, the OK/WARN/ERROR health state is generated on-demand by
applying a bunch of checks to the state of the system when the user
runs the health command -- the system doesn't know it's in a warning
state until it's asked.  Often you will see a corresponding log
message, but not necessarily.

John

> Best,
> Scott
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random Health_warn

2017-02-23 Thread Robin H. Johnson
On Thu, Feb 23, 2017 at 10:40:31PM +, Scottix wrote:
> Ya the ceph-mon.$ID.log
> 
> I was running ceph -w when one of them occurred too and it never output
> anything.
> 
> Here is a snippet for the the 5:11AM occurrence.
Yep, I don't see anything in there that should have triggered
HEALTH_WARN.

All I can suggest is dumping the JSON health blob when it occurs again,
and seeing if anything stands out in it.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random Health_warn

2017-02-23 Thread John Spray
On Thu, Feb 23, 2017 at 9:49 PM, Scottix  wrote:
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> We are seeing a weird behavior or not sure how to diagnose what could be
> going on. We started monitoring the overall_status from the json query and
> every once in a while we would get a HEALTH_WARN for a minute or two.
>
> Monitoring logs.
> 02/23/2017 07:25:54 AM HEALTH_OK
> 02/23/2017 07:24:54 AM HEALTH_WARN
> 02/23/2017 07:23:55 AM HEALTH_OK
> 02/23/2017 07:22:54 AM HEALTH_OK
> ...
> 02/23/2017 05:13:55 AM HEALTH_OK
> 02/23/2017 05:12:54 AM HEALTH_WARN
> 02/23/2017 05:11:54 AM HEALTH_WARN
> 02/23/2017 05:10:54 AM HEALTH_OK
> 02/23/2017 05:09:54 AM HEALTH_OK
>
> When I check the mon leader logs there is no indication of an error or
> issues that could be occuring. Is there a way to find what is causing the
> HEALTH_WARN?

Possibly not without grabbing more than just the overall status at the
same time as you're grabbing the OK/WARN status.

Internally, the OK/WARN/ERROR health state is generated on-demand by
applying a bunch of checks to the state of the system when the user
runs the health command -- the system doesn't know it's in a warning
state until it's asked.  Often you will see a corresponding log
message, but not necessarily.

John

> Best,
> Scott
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random Health_warn

2017-02-23 Thread Scottix
Ya the ceph-mon.$ID.log

I was running ceph -w when one of them occurred too and it never output
anything.

Here is a snippet for the the 5:11AM occurrence.

On Thu, Feb 23, 2017 at 1:56 PM Robin H. Johnson  wrote:

> On Thu, Feb 23, 2017 at 09:49:21PM +, Scottix wrote:
> > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
> >
> > We are seeing a weird behavior or not sure how to diagnose what could be
> > going on. We started monitoring the overall_status from the json query
> and
> > every once in a while we would get a HEALTH_WARN for a minute or two.
> >
> > Monitoring logs.
> > 02/23/2017 07:25:54 AM HEALTH_OK
> > 02/23/2017 07:24:54 AM HEALTH_WARN
> > 02/23/2017 07:23:55 AM HEALTH_OK
> > 02/23/2017 07:22:54 AM HEALTH_OK
> > ...
> > 02/23/2017 05:13:55 AM HEALTH_OK
> > 02/23/2017 05:12:54 AM HEALTH_WARN
> > 02/23/2017 05:11:54 AM HEALTH_WARN
> > 02/23/2017 05:10:54 AM HEALTH_OK
> > 02/23/2017 05:09:54 AM HEALTH_OK
> >
> > When I check the mon leader logs there is no indication of an error or
> > issues that could be occuring. Is there a way to find what is causing the
> > HEALTH_WARN?
> By leader logs, do you mean the cluster log (mon_cluster_log_file), or
> the mon log (log_file)? Eg /var/log/ceph/ceph.log vs
> /var/log/ceph/ceph-mon.$ID.log.
>
> Could you post the log entries for a time period between two HEALTH_OK
> states with a HEALTH_WARN in the middle?
>
> The reason for WARN _should_ be included on the logged status line.
>
> Alternatively, you should be able to just log the output of 'ceph -w'
> for a while, and find the WARN status as well.
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
2017-02-23 05:10:54.139358 7f5c17894700  0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1
2017-02-23 05:10:54.139549 7f5c17894700  0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031767' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch
2017-02-23 05:10:54.535319 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496604: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 16681 kB/s rd, 11886 kB/s wr, 705 op/s
2017-02-23 05:10:55.600104 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496605: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14716 kB/s rd, 6627 kB/s wr, 408 op/s
2017-02-23 05:10:56.170435 7f5c17894700  0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1
2017-02-23 05:10:56.170502 7f5c17894700  0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031899' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch
2017-02-23 05:10:56.642040 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496606: 5120 pgs: 2 active+clean+scrubbing, 5111 active+clean, 7 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14617 kB/s rd, 6580 kB/s wr, 537 op/s
2017-02-23 05:10:57.667496 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496607: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 8862 kB/s rd, 7126 kB/s wr, 552 op/s
2017-02-23 05:10:58.736114 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496608: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 14126 kB/s rd, 11254 kB/s wr, 974 op/s
2017-02-23 05:10:59.451884 7f5c17894700  0 mon.CephMon200@0(leader) e7 handle_command mon_command({"prefix": "status", "format": "json"} v 0) v1
2017-02-23 05:10:59.451903 7f5c17894700  0 log_channel(audit) log [DBG] : from='client.? 10.10.1.30:0/1031932' entity='client.admin' cmd=[{"prefix": "status", "format": "json"}]: dispatch
2017-02-23 05:10:59.812909 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496609: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 11238 kB/s rd, 8236 kB/s wr, 785 op/s
2017-02-23 05:11:00.829329 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496610: 5120 pgs: 2 active+clean+scrubbing, 5110 active+clean, 8 active+clean+scrubbing+deep; 58071 GB data, 114 TB used, 113 TB / 227 TB avail; 6193 kB/s rd, 7345 kB/s wr, 186 op/s
2017-02-23 05:11:01.850120 7f5c1a25c700  0 log_channel(cluster) log [INF] : pgmap v77496611: 5120 

Re: [ceph-users] Random Health_warn

2017-02-23 Thread Robin H. Johnson
On Thu, Feb 23, 2017 at 09:49:21PM +, Scottix wrote:
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
> 
> We are seeing a weird behavior or not sure how to diagnose what could be
> going on. We started monitoring the overall_status from the json query and
> every once in a while we would get a HEALTH_WARN for a minute or two.
> 
> Monitoring logs.
> 02/23/2017 07:25:54 AM HEALTH_OK
> 02/23/2017 07:24:54 AM HEALTH_WARN
> 02/23/2017 07:23:55 AM HEALTH_OK
> 02/23/2017 07:22:54 AM HEALTH_OK
> ...
> 02/23/2017 05:13:55 AM HEALTH_OK
> 02/23/2017 05:12:54 AM HEALTH_WARN
> 02/23/2017 05:11:54 AM HEALTH_WARN
> 02/23/2017 05:10:54 AM HEALTH_OK
> 02/23/2017 05:09:54 AM HEALTH_OK
> 
> When I check the mon leader logs there is no indication of an error or
> issues that could be occuring. Is there a way to find what is causing the
> HEALTH_WARN?
By leader logs, do you mean the cluster log (mon_cluster_log_file), or
the mon log (log_file)? Eg /var/log/ceph/ceph.log vs 
/var/log/ceph/ceph-mon.$ID.log.

Could you post the log entries for a time period between two HEALTH_OK
states with a HEALTH_WARN in the middle?

The reason for WARN _should_ be included on the logged status line.

Alternatively, you should be able to just log the output of 'ceph -w'
for a while, and find the WARN status as well.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Random Health_warn

2017-02-23 Thread Scottix
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

We are seeing a weird behavior or not sure how to diagnose what could be
going on. We started monitoring the overall_status from the json query and
every once in a while we would get a HEALTH_WARN for a minute or two.

Monitoring logs.
02/23/2017 07:25:54 AM HEALTH_OK
02/23/2017 07:24:54 AM HEALTH_WARN
02/23/2017 07:23:55 AM HEALTH_OK
02/23/2017 07:22:54 AM HEALTH_OK
...
02/23/2017 05:13:55 AM HEALTH_OK
02/23/2017 05:12:54 AM HEALTH_WARN
02/23/2017 05:11:54 AM HEALTH_WARN
02/23/2017 05:10:54 AM HEALTH_OK
02/23/2017 05:09:54 AM HEALTH_OK

When I check the mon leader logs there is no indication of an error or
issues that could be occuring. Is there a way to find what is causing the
HEALTH_WARN?

Best,
Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com