Hey,

Sadly I'm still battling this issue. I did notice one interesting thing.

I changed the cache settings for my cache tier to add redundancy to the
pool which means a lot of recover activity on the cache. During all this
there were absolutely no slow requests reported. Is there anything I can
conclude from that information? Is it possible that not having any SSDs for
journals could be the bottleneck on my erasure pool and that's generating
slow requests?

I simply can't imagine why a request can be blocked for 30 or even 60
seconds. It's getting really frustrating not being able to fix this and I
simply don't know what else I can do at this point.

If anybody has anything I haven't tried before please let me know.

Peter

On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle <[email protected]>
wrote:

> Hey guys,
>
> I'm running into an issue with my cluster during high activity.
>
> I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with
> 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit
> cluster network for the erasure pool (10x3TB without separate journal).
> This is all on Jewel.
>
> It's working fine with normal load. However when I force increased
> activity by lowering the cache_target_dirty_ratio to make sure my files are
> promoted things start to go amiss.
>
> To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedg
>
> This is especially concering:  pgs: 9 activating+undersized+degraded, 48
> active+undersized+degraded, 1 stale+active+clean, 27 peering.
>
> Here is an other minute or so where I grepped for warnings:
> http://pastie.org/private/bfv3kxl63cfcduafoaurog
>
> These warnings are generated all over the OSD nodes, not specifically one
> OSD or even one node.
>
> During this time the different OSD logs show various warnings:
>
> 2016-05-05 10:04:10.873603 7f1afaf3b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
> 2016-05-05 10:04:10.873605 7f1afaf3b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
> 2016-05-05 10:04:10.905997 7f1afc73e700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
> 2016-05-05 10:04:10.906000 7f1afc73e700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
> 2016-05-05 10:04:10.906022 7f1afaf3b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
> 2016-05-05 10:04:10.906027 7f1afaf3b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
> 2016-05-05 10:04:10.949894 7f1af3f2d700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
> 2016-05-05 10:04:10.956801 7f1afc73e700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
> 2016-05-05 10:04:10.956833 7f1afaf3b700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
> 2016-05-05 10:04:10.957816 7f1af5f31700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
>
> or
>
> 2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 heartbeat_check: no
> reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05
> 10:03:06.566276 (cutoff 2016-05-05 10:03:07.269651)
> 2016-05-05 10:03:28.269838 7f98cca35700 -1 osd.6 7235 heartbeat_check: no
> reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05
> 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831)
> 2016-05-05 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no
> reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05
> 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)
> 2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 heartbeat_check: no
> reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05
> 10:03:06.566276 (cutoff 2016-05-05 10:03:09.801142)
> 2016-05-05 10:04:06.275237 7f98cca35700  0 log_channel(cluster) log [WRN]
> : 1 slow requests, 1 included below; oldest blocked for > 30.910385 secs
> 2016-05-05 10:04:06.275252 7f98cca35700  0 log_channel(cluster) log [WRN]
> : slow request 30.910385 seconds old, received at 2016-05-05
> 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254
> rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] snapc 0=[]
> ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e7235)
> currently commit_sent
>
>
> I've followed the instructions on
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> to hopefully find out what's happening but as far as the hardware is
> concerned everything looks fine. No smart errors logged, iostats shows some
> activity but nothing pegged to 100%, no messages in dmesg and the cpus are
> only used for around 25% max.
>
> This makes me think it might be network related. I have a 10Gbit public
> and 10Gbit cluster network neither of which seems to hit any limits. There
> is one thing that might be a problem and that is that one of the cache node
> has a bonded interface and no access to the cluster network, and the other
> cache node has a public and cluster interface.
>
> Could anybody give me some more steps I can take to further discover where
> this bottleneck lies?
>
> Thanks in advance,
>
> Peter
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to