Thanks for the info. I was investigating bluestore as well. My host dont go unresponsive but I do see parallel io slow down.
On Thu, Aug 23, 2018, 8:02 PM Andras Pataki <[email protected]> wrote: > We are also running some fairly dense nodes with CentOS 7.4 and ran into > similar problems. The nodes ran filestore OSDs (Jewel, then Luminous). > Sometimes a node would be so unresponsive that one couldn't even ssh to > it (even though the root disk was a physically separate drive on a > separate controller from the OSD drives). Often these would coincide > with kernel stack traces about hung tasks. Initially we did blame high > load, etc. from all the OSDs. > > But then we benchmarked the nodes independently of ceph (with iozone and > such) and noticed problems there too. When we started a few dozen > iozone processes on separate JBOD drives with xfs, some didn't even > start and write a single byte for minutes. The conclusion we came to > was that there is some interference among a lot of mounted xfs file > systems in the Red Hat 3.10 kernels. Some kind of central lock that > prevents dozens of xfs file systems from running in parallel. When we > do I/O directly to raw devices in parallel, we saw no problems (no high > loads, etc.). So we built a newer kernel, and the situation got > better. 4.4 is already much better, nowadays we are testing moving to > 4.14. > > Also, migrating to bluestore significantly reduced the load on these > nodes too. At busy times, the filestore host loads were 20-30, even > higher (on a 28 core node), while the bluestore nodes hummed along at a > lot of perhaps 6 or 8. This also confirms that somehow lots of xfs > mounts don't work in parallel. > > Andras > > > On 08/23/2018 03:24 PM, Tyler Bishop wrote: > > Yes I've reviewed all the logs from monitor and host. I am not > > getting useful errors (or any) in dmesg or general messages. > > > > I have 2 ceph clusters, the other cluster is 300 SSD and i never have > > issues like this. That's why Im looking for help. > > > > On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev <[email protected]> > wrote: > >> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop > >> <[email protected]> wrote: > >>> During high load testing I'm only seeing user and sys cpu load around > 60%... my load doesn't seem to be anything crazy on the host and iowait > stays between 6 and 10%. I have very good `ceph osd perf` numbers too. > >>> > >>> I am using 10.2.11 Jewel. > >>> > >>> > >>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer <[email protected]> > wrote: > >>>> Hello, > >>>> > >>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote: > >>>> > >>>>> Hi, I've been fighting to get good stability on my cluster for > about > >>>>> 3 weeks now. I am running into intermittent issues with OSD flapping > >>>>> marking other OSD down then going back to a stable state for hours > and > >>>>> days. > >>>>> > >>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G > >>>>> Network to 40G Brocade VDX Switches. The OSD are 6TB HGST SAS drives > >>>>> with 400GB HGST SAS 12G SSDs. My configuration is 4 journals per > >>>>> host with 12 disk per journal for a total of 56 disk per system and > 52 > >>>>> OSD. > >>>>> > >>>> Any denser and you'd have a storage black hole. > >>>> > >>>> You already pointed your finger in the (or at least one) right > direction > >>>> and everybody will agree that this setup is woefully underpowered in > the > >>>> CPU department. > >>>> > >>>>> I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile > >>>>> for throughput-performance enabled. > >>>>> > >>>> Ceph version would be interesting as well... > >>>> > >>>>> I have these sysctls set: > >>>>> > >>>>> kernel.pid_max = 4194303 > >>>>> fs.file-max = 6553600 > >>>>> vm.swappiness = 0 > >>>>> vm.vfs_cache_pressure = 50 > >>>>> vm.min_free_kbytes = 3145728 > >>>>> > >>>>> I feel like my issue is directly related to the high number of OSD > per > >>>>> host but I'm not sure what issue I'm really running into. I believe > >>>>> that I have ruled out network issues, i am able to get 38Gbit > >>>>> consistently via iperf testing and mtu for jump pings successfully > >>>>> with no fragment set and 8972 packet size. > >>>>> > >>>> The fact that it all works for days at a time suggests this as well, > but > >>>> you need to verify these things when they're happening. > >>>> > >>>>> From FIO testing I seem to be able to get 150-200k iops write from > my > >>>>> rbd clients on 1gbit networking... This is about what I expected due > >>>>> to the write penalty and my underpowered CPU for the number of OSD. > >>>>> > >>>>> I get these messages which I believe are normal? > >>>>> 2018-08-22 10:33:12.754722 7f7d009f5700 0 -- > 10.20.136.8:6894/718902 > >>>>>>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2 > >>>>> pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, > going > >>>>> to standby > >>>>> > >>>> Ignore. > >>>> > >>>>> Then randomly I'll get a storm of this every few days for 20 minutes > or so: > >>>>> 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333 > >>>>> heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back > >>>>> 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff > >>>>> 2018-08-22 15:48:12.630773) > >>>>> > >>>> Randomly is unlikely. > >>>> Again, catch it in the act, atop in huge terminal windows (showing all > >>>> CPUs and disks) for all nodes should be very telling, collecting and > >>>> graphing this data might work, too. > >>>> > >>>> My suspects would be deep scrubs and/or high IOPS spikes when this is > >>>> happening, starving out OSD processes (CPU wise, RAM should be fine > one > >>>> supposes). > >>>> > >>>> Christian > >>>> > >>>>> Please help!!! > >> Have you looked at the OSD logs on the OSD nodes by chance? I found > >> that correlating the messages in those logs with your master ceph log > >> and also correlating with any messages in syslog or kern.log can > >> elucidate the cause of the problem pretty well. > >> -- > >> Alex Gorbachev > >> Storcium > >> > >> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> [email protected] > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>> > >>>> > >>>> -- > >>>> Christian Balzer Network/Systems Engineer > >>>> [email protected] Rakuten Communications > >>> _______________________________________________ > >>> ceph-users mailing list > >>> [email protected] > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
