Thanks for the info. I was investigating bluestore as well.  My host dont
go unresponsive but I do see parallel io slow down.

On Thu, Aug 23, 2018, 8:02 PM Andras Pataki <[email protected]>
wrote:

> We are also running some fairly dense nodes with CentOS 7.4 and ran into
> similar problems.  The nodes ran filestore OSDs (Jewel, then Luminous).
> Sometimes a node would be so unresponsive that one couldn't even ssh to
> it (even though the root disk was a physically separate drive on a
> separate controller from the OSD drives).  Often these would coincide
> with kernel stack traces about hung tasks. Initially we did blame high
> load, etc. from all the OSDs.
>
> But then we benchmarked the nodes independently of ceph (with iozone and
> such) and noticed problems there too.  When we started a few dozen
> iozone processes on separate JBOD drives with xfs, some didn't even
> start and write a single byte for minutes.  The conclusion we came to
> was that there is some interference among a lot of mounted xfs file
> systems in the Red Hat 3.10 kernels.  Some kind of central lock that
> prevents dozens of xfs file systems from running in parallel.  When we
> do I/O directly to raw devices in parallel, we saw no problems (no high
> loads, etc.).  So we built a newer kernel, and the situation got
> better.  4.4 is already much better, nowadays we are testing moving to
> 4.14.
>
> Also, migrating to bluestore significantly reduced the load on these
> nodes too.  At busy times, the filestore host loads were 20-30, even
> higher (on a 28 core node), while the bluestore nodes hummed along at a
> lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs
> mounts don't work in parallel.
>
> Andras
>
>
> On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> > Yes I've reviewed all the logs from monitor and host.   I am not
> > getting useful errors (or any) in dmesg or general messages.
> >
> > I have 2 ceph clusters, the other cluster is 300 SSD and i never have
> > issues like this.   That's why Im looking for help.
> >
> > On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev <[email protected]>
> wrote:
> >> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
> >> <[email protected]> wrote:
> >>> During high load testing I'm only seeing user and sys cpu load around
> 60%... my load doesn't seem to be anything crazy on the host and iowait
> stays between 6 and 10%.  I have very good `ceph osd perf` numbers too.
> >>>
> >>> I am using 10.2.11 Jewel.
> >>>
> >>>
> >>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer <[email protected]>
> wrote:
> >>>> Hello,
> >>>>
> >>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
> >>>>
> >>>>> Hi,   I've been fighting to get good stability on my cluster for
> about
> >>>>> 3 weeks now.  I am running into intermittent issues with OSD flapping
> >>>>> marking other OSD down then going back to a stable state for hours
> and
> >>>>> days.
> >>>>>
> >>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
> >>>>> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
> >>>>> with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
> >>>>> host with 12 disk per journal for a total of 56 disk per system and
> 52
> >>>>> OSD.
> >>>>>
> >>>> Any denser and you'd have a storage black hole.
> >>>>
> >>>> You already pointed your finger in the (or at least one) right
> direction
> >>>> and everybody will agree that this setup is woefully underpowered in
> the
> >>>> CPU department.
> >>>>
> >>>>> I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
> >>>>> for throughput-performance enabled.
> >>>>>
> >>>> Ceph version would be interesting as well...
> >>>>
> >>>>> I have these sysctls set:
> >>>>>
> >>>>> kernel.pid_max = 4194303
> >>>>> fs.file-max = 6553600
> >>>>> vm.swappiness = 0
> >>>>> vm.vfs_cache_pressure = 50
> >>>>> vm.min_free_kbytes = 3145728
> >>>>>
> >>>>> I feel like my issue is directly related to the high number of OSD
> per
> >>>>> host but I'm not sure what issue I'm really running into.   I believe
> >>>>> that I have ruled out network issues, i am able to get 38Gbit
> >>>>> consistently via iperf testing and mtu for jump pings successfully
> >>>>> with no fragment set and 8972 packet size.
> >>>>>
> >>>> The fact that it all works for days at a time suggests this as well,
> but
> >>>> you need to verify these things when they're happening.
> >>>>
> >>>>>  From FIO testing I seem to be able to get 150-200k iops write from
> my
> >>>>> rbd clients on 1gbit networking... This is about what I expected due
> >>>>> to the write penalty and my underpowered CPU for the number of OSD.
> >>>>>
> >>>>> I get these messages which I believe are normal?
> >>>>> 2018-08-22 10:33:12.754722 7f7d009f5700  0 --
> 10.20.136.8:6894/718902
> >>>>>>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
> >>>>> pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send,
> going
> >>>>> to standby
> >>>>>
> >>>> Ignore.
> >>>>
> >>>>> Then randomly I'll get a storm of this every few days for 20 minutes
> or so:
> >>>>> 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
> >>>>> heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
> >>>>> 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
> >>>>> 2018-08-22 15:48:12.630773)
> >>>>>
> >>>> Randomly is unlikely.
> >>>> Again, catch it in the act, atop in huge terminal windows (showing all
> >>>> CPUs and disks) for all nodes should be very telling, collecting and
> >>>> graphing this data might work, too.
> >>>>
> >>>> My suspects would be deep scrubs and/or high IOPS spikes when this is
> >>>> happening, starving out OSD processes (CPU wise, RAM should be fine
> one
> >>>> supposes).
> >>>>
> >>>> Christian
> >>>>
> >>>>> Please help!!!
> >> Have you looked at the OSD logs on the OSD nodes by chance?  I found
> >> that correlating the messages in those logs with your master ceph log
> >> and also correlating with any messages in syslog or kern.log can
> >> elucidate the cause of the problem pretty well.
> >> --
> >> Alex Gorbachev
> >> Storcium
> >>
> >>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> [email protected]
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>
> >>>> --
> >>>> Christian Balzer        Network/Systems Engineer
> >>>> [email protected]           Rakuten Communications
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> [email protected]
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to