Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

Maged Mokhtar Fri, 02 Mar 2018 01:17:29 -0800

On 2018-03-02 07:54, Alex Gorbachev wrote:

> On Thu, Mar 1, 2018 at 10:57 PM, David Turner <[email protected]> wrote: 
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
> 
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <[email protected]> 
> wrote: 
> On Thu, Mar 1, 2018 at 2:47 PM, David Turner <[email protected]>
> wrote: `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked
> or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that. 
> Hi David,
> 
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
> 
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
> 
> Regards,
> Alex
> 
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <[email protected]>
> wrote: 
> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
> 
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
> 
> --


I was wrong where the pool compression does not matter, even
uncompressed pool also generates these slow messages.

Question is why no subsequent message relating to specific OSDs (like
in Jewel and prior, like this example from RH:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
old, received at {date-time}: osd_op(client.4240.0:8
benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
currently waiting for subops from [610]

In comparison, my Luminous cluster only shows the general slow/blocked
message:

2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
[INF] : Cluster is now healthy

So where are the details?

Thanks,
Alex
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

Hi Alex,

I agree, the health detail should display osd info.
Regarding the slow requests: ESX is very sensitive to io delay, this
could be due to cluster overload or disk going bad. If you have a
monitoring tool, do you see high load spikes (such as disk % busy ) on
your disks ? Generally it is good (if you can) simulate the max client
workload + simulate concurrent recovery and scrubbing and monitor your
resources and see if they get overloaded. Most if not all issues with
ESX will go away if you have enough hardware. I suspect this is what you
are see-ing, specially the VMotion and compression could be the extra
load which triggered the issue, in such case the problem is not specific
to a particular osd. The other possibility is that a particular disk is
going bad, the monitoring tool should be able to identify this. 

Note that VMotion generates an io pattern that keeps hitting the same
osd with concurrent small writes, if you can use rbd striping, creating
64k stripes may help. A controller with write back cache will also help.

Maged

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

Reply via email to