We had suicide timeouts, but unfortunately I can't remember the specific root cause at this point.
It was definitely one of two things: * too low of net.netfilter.nf_conntrack_max (preventing the osds from opening new connections to each other) * too low of kernel.pid_max or kernel.threads-max (preventing new threads from starting) I am pretty sure we hit the pid_max early on, but the conntrack_max didn't cause trouble until we had enough VMs running to push the total number of connections past the default limit. Our cluster is approximately the same size as yours. Jordan On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang <[email protected]> wrote: > We don't use EC pools, but my experience with similar slow requests on > RGW+replicated_pools is that in the logs you need to find out the first > slow request and identify where it's from, for example, is it deep-scrub, > or some client accessing corrupted objects, disk errors etc. > > On 20/09/17 8:13 AM, David Turner wrote: > > Just starting 3 nights ago we started seeing OSDs randomly going down in > our cluster (Jewel 10.2.7). At first I saw that each OSD that was recently > marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' | > sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the > most recent OSD map epochs) and all of them had been wrongly marked down. > Prior to this there is a lot of evidence of slow requests, OSD op thread > timing out, Filestore op thread timing out, and other errors. At the > bottom of the email is an excerpt of such errors. It is definitely not a > comprehensive log of these errors. > > The map epoch of the last time the OSD is marked down in the OSD map > matches when it logs that it was wrongly marked down. After thousands more > lines of op threads timing out and slow requests, the OSD finally asserts > with "hit suicide timeout". The first 2 nights this was localized to 2 > osds on the same host, but last night this happened on an osd that is on a > second host. 2 of the 3 OSDs have hit this 3 times and the other has hit > it twice. > > There are 15 OSD nodes and 224 total OSDs in the cluster. So far only > these 3 OSDs have hit this FAILED assert. Does anyone have any ideas of > what to do next? Or have any information that is missing here to be able > to understand things further? It's too intermittent to run the OSD with > log level 20. > > 2017-09-19 02:06:19.712334 7fdfeae86700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15 > 2017-09-19 02:06:19.712358 7fdfeae86700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60 > 2017-09-19 02:06:55.974099 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : 891 slow requests, 5 included below; oldest blocked for > 150.987639 secs > 2017-09-19 02:06:55.974114 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : slow request 150.985487 seconds old, received at 2017-09-19 > 02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420 > 4267, reqid=client.24351228.0:139832730, at_version=48121'1955383, > trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started > 2017-09-19 02:06:55.974123 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : slow request 150.510718 seconds old, received at 2017-09-19 > 02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540 > 7590, reqid=client.24892528.0:36410074, at_version=48121'1960640, > trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started > 2017-09-19 02:06:55.974128 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : slow request 149.177285 seconds old, received at 2017-09-19 > 02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87 > 93253, reqid=client.24892528.0:36410582, at_version=48121'1964351, > trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently > queued_for_pg > 2017-09-19 02:06:55.974134 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : slow request 147.359320 seconds old, received at 2017-09-19 > 02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228 > 587, reqid=client.18422767.0:329511916, at_version=48121'1965073, > trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently > queued_for_pg > 2017-09-19 02:06:55.974139 7fdfe5b11700 0 log_channel(cluster) log [WRN] > : slow request 146.404241 seconds old, received at 2017-09-19 > 02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420 > 4291, reqid=client.18422767.0:329512018, at_version=48121'1955389, > trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently > queued_for_pg > 2017-09-19 02:06:55.974276 7fdf7aa6a700 0 -- 10.10.113.29:6826/3332721 > >> 10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268 > cs=1 l=0 c=0x7fe01b44be00).connect got RESETSESSION > 2017-09-19 02:06:55.974626 7fdf78141700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1 pgs=7059 > cs=1 l=0 c=0x7fe01be72600).connect got RESETSESSION > 2017-09-19 02:06:55.974657 7fdf78e4e700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.35:6827/477826 pipe(0x7fe004aef400 sd=91 :19464 s=1 pgs=7316 > cs=1 l=0 c=0x7fe01b179080).connect got RESETSESSION > 2017-09-19 02:06:55.977120 7fdf95310700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.22:6812/2276428 pipe(0x7fe003d15400 sd=86 :58892 s=1 pgs=15908 > cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION > 2017-09-19 02:06:55.979722 7fdf9500d700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.27:6830/2018590 pipe(0x7fe003a06800 sd=98 :42191 s=1 pgs=12697 > cs=1 l=0 c=0x7fe01b44d780).connect got RESETSESSION > 2017-09-19 02:06:56.106436 7fdfba1dc700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=1 pgs=11500 > cs=1 l=0 c=0x7fe005820880).connect got RESETSESSION > 2017-09-19 02:06:56.107146 7fdfbbaf5700 0 -- 10.10.13.29:6826/3332721 >> > 10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=2 pgs=11602 > cs=1 l=0 c=0x7fe005820880).fault, initiating reconnect > --- > 2017-09-19 02:06:56.213980 7fdfdd58d700 0 log_channel(cluster) log [WRN] > : map e48123 wrongly marked me down > --- > 2017-09-19 03:06:34.778837 7fdfeae86700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60 > 2017-09-19 03:06:34.778840 7fdfeae86700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7fdfddd8e700' had timed out after 60 > 2017-09-19 03:06:39.778908 7fdfeae86700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fdfc422f700' had timed out after 15 > 2017-09-19 03:06:39.778921 7fdfeae86700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fdfc5231700' had timed out after 15 > 2017-09-19 03:06:39.778930 7fdfeae86700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fdfc5231700' had suicide timed out after 150 > 2017-09-19 03:06:39.782749 7fdfeae86700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(const > ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fdfeae86700 time > 2017-09-19 > 03:06:39.778940 > common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > > > _______________________________________________ > ceph-users mailing > [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > > *Stanley Zhang | * Senior Operations Engineer > *Telephone:* +64 9 302 0515 <+64%209-302%200515> *Fax:* +64 9 302 0518 > <+64%209-302%200518> > *Mobile:* +64 22 318 3664 <+64%2022%20318%203664> *Freephone:* 0800 SMX > SMX (769 769) > *SMX Limited:* Level 15, 19 Victoria Street West, Auckland, New Zealand > *Web:* http://smxemail.com > [image: SMX | Cloud Email Hosting & Security] > This email has been filtered by SMX. For more information visit > smxemail.com. > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
