Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

Alex Gorbachev Fri, 02 Mar 2018 06:57:18 -0800

On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar <[email protected]> wrote:


> On 2018-03-02 07:54, Alex Gorbachev wrote:
>
> On Thu, Mar 1, 2018 at 10:57 PM, David Turner <[email protected]>
> wrote:
>
> Blocked requests and slow requests are synonyms in ceph. They are 2 names
> for the exact same thing.
>
>
> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <[email protected]>
> wrote:
>
>
> On Thu, Mar 1, 2018 at 2:47 PM, David Turner <[email protected]>
> wrote:
>
> `ceph health detail` should show you more information about the slow
> requests.  If the output is too much stuff, you can grep out for blocked
> or
> something.  It should tell you which OSDs are involved, how long they've
> been slow, etc.  The default is for them to show '> 32 sec' but that may
> very well be much longer and `ceph health detail` will show that.
>
>
> Hi David,
>
> Thank you for the reply.  Unfortunately, the health detail only shows
> blocked requests.  This seems to be related to a compression setting
> on the pool, nothing in OSD logs.
>
> I replied to another compression thread.  This makes sense since
> compression is new, and in the past all such issues were reflected in
> OSD logs and related to either network or OSD hardware.
>
> Regards,
> Alex
>
>
> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <[email protected]>
> wrote:
>
>
> Is there a switch to turn on the display of specific OSD issues?  Or
> does the below indicate a generic problem, e.g. network and no any
> specific OSD?
>
> 2018-02-28 18:09:36.438300 7f6dead56700  0
> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
> total 15997 MB, used 6154 MB, avail 9008 MB
> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-02-28 18:09:53.794882 7f6de8551700  0
> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
> "status", "format": "json"} v 0) v1
>
> --
>
>
> I was wrong where the pool compression does not matter, even
> uncompressed pool also generates these slow messages.
>
> Question is why no subsequent message relating to specific OSDs (like
> in Jewel and prior, like this example from RH:
>
> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>
> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
> old, received at {date-time}: osd_op(client.4240.0:8
> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
> currently waiting for subops from [610]
>
> In comparison, my Luminous cluster only shows the general slow/blocked
> message:
>
> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
> blocked > 32 sec)
> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
> [INF] : Cluster is now healthy
>
> So where are the details?
>
> Thanks,
> Alex
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> Hi Alex,
>
> I agree, the health detail should display osd info.
> Regarding the slow requests: ESX is very sensitive to io delay, this could
> be due to cluster overload or disk going bad. If you have a monitoring
> tool, do you see high load spikes (such as disk % busy ) on your disks ?
> Generally it is good (if you can) simulate the max client workload +
> simulate concurrent recovery and scrubbing and monitor your resources and
> see if they get overloaded. Most if not all issues with ESX will go away if
> you have enough hardware. I suspect this is what you are see-ing, specially
> the VMotion and compression could be the extra load which triggered the
> issue, in such case the problem is not specific to a particular osd. The
> other possibility is that a particular disk is going bad, the monitoring
> tool should be able to identify this.
>
> Note that VMotion generates an io pattern that keeps hitting the same osd
> with concurrent small writes, if you can use rbd striping, creating 64k
> stripes may help. A controller with write back cache will also help.
>
> Maged
>

Thank you, Maged. This does seem to be related to compression. I did two
things - turned off compression on pool and tried vmotion to same image -
failed.

Then I made another image and that time it worked. So maybe some compressed
parts of the old image introduced slowness.

To be fair, I also set all Arecas to 75% buffer, so will be testing more
today if that is the case.

All OSDs look healthy. Will update...but yes odd that no OSD detail display
with slow requests.  May need a tracker for that.

Best regards,
Alex

> --
--
Alex Gorbachev
Storcium

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests troubleshooting in Luminous - details missing

Reply via email to