Re: [ceph-users] Stability Issue with 52 OSD hosts

Tyler Bishop Fri, 24 Aug 2018 06:09:18 -0700

How did you solve that?

On Fri, Aug 24, 2018, 6:06 AM Andras Pataki <[email protected]>
wrote:


> We pin half the OSDs to each socket (and to the corresponding memory).
> Since the disk controller and the network card is connected only to one
> socket, this still probably produces quite a bit of QPI traffic.
> It is also worth investigating how the network does under high load.  We
> did run into problems where 40Gbps cards dropped packets heavily under
> load.
>
> Andras
>
>
> On 08/24/2018 05:16 AM, Marc Roos wrote:
> >
> > Can this be related to numa issues? I have also dual processor nodes,
> > and was wondering if there is some guide on how to optimize for numa.
> >
> >
> >
> >
> > -----Original Message-----
> > From: Tyler Bishop [mailto:[email protected]]
> > Sent: vrijdag 24 augustus 2018 3:11
> > To: Andras Pataki
> > Cc: [email protected]
> > Subject: Re: [ceph-users] Stability Issue with 52 OSD hosts
> >
> > Thanks for the info. I was investigating bluestore as well.  My host
> > dont go unresponsive but I do see parallel io slow down.
> >
> > On Thu, Aug 23, 2018, 8:02 PM Andras Pataki
> > <[email protected]> wrote:
> >
> >
> >       We are also running some fairly dense nodes with CentOS 7.4 and ran
> > into
> >       similar problems.  The nodes ran filestore OSDs (Jewel, then
> > Luminous).
> >       Sometimes a node would be so unresponsive that one couldn't even
> > ssh to
> >       it (even though the root disk was a physically separate drive on a
> >       separate controller from the OSD drives).  Often these would
> > coincide
> >       with kernel stack traces about hung tasks. Initially we did blame
> > high
> >       load, etc. from all the OSDs.
> >
> >       But then we benchmarked the nodes independently of ceph (with
> > iozone and
> >       such) and noticed problems there too.  When we started a few dozen
> >       iozone processes on separate JBOD drives with xfs, some didn't even
> >
> >       start and write a single byte for minutes.  The conclusion we came
> > to
> >       was that there is some interference among a lot of mounted xfs file
> >
> >       systems in the Red Hat 3.10 kernels.  Some kind of central lock
> > that
> >       prevents dozens of xfs file systems from running in parallel.  When
> > we
> >       do I/O directly to raw devices in parallel, we saw no problems (no
> > high
> >       loads, etc.).  So we built a newer kernel, and the situation got
> >       better.  4.4 is already much better, nowadays we are testing moving
> > to 4.14.
> >
> >       Also, migrating to bluestore significantly reduced the load on
> > these
> >       nodes too.  At busy times, the filestore host loads were 20-30,
> > even
> >       higher (on a 28 core node), while the bluestore nodes hummed along
> > at a
> >       lot of perhaps 6 or 8.  This also confirms that somehow lots of xfs
> >
> >       mounts don't work in parallel.
> >
> >       Andras
> >
> >
> >       On 08/23/2018 03:24 PM, Tyler Bishop wrote:
> >       > Yes I've reviewed all the logs from monitor and host.   I am not
> >       > getting useful errors (or any) in dmesg or general messages.
> >       >
> >       > I have 2 ceph clusters, the other cluster is 300 SSD and i never
> > have
> >       > issues like this.   That's why Im looking for help.
> >       >
> >       > On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev
> > <[email protected]> wrote:
> >       >> On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop
> >       >> <[email protected]> wrote:
> >       >>> During high load testing I'm only seeing user and sys cpu load
> > around 60%... my load doesn't seem to be anything crazy on the host and
> > iowait stays between 6 and 10%.  I have very good `ceph osd perf`
> > numbers too.
> >       >>>
> >       >>> I am using 10.2.11 Jewel.
> >       >>>
> >       >>>
> >       >>> On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer
> > <[email protected]> wrote:
> >       >>>> Hello,
> >       >>>>
> >       >>>> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote:
> >       >>>>
> >       >>>>> Hi,   I've been fighting to get good stability on my cluster
> > for about
> >       >>>>> 3 weeks now.  I am running into intermittent issues with OSD
> > flapping
> >       >>>>> marking other OSD down then going back to a stable state for
> > hours and
> >       >>>>> days.
> >       >>>>>
> >       >>>>> The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB
> > ram, 40G
> >       >>>>> Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST
> > SAS drives
> >       >>>>> with 400GB HGST SAS 12G SSDs.   My configuration is 4
> > journals per
> >       >>>>> host with 12 disk per journal for a total of 56 disk per
> > system and 52
> >       >>>>> OSD.
> >       >>>>>
> >       >>>> Any denser and you'd have a storage black hole.
> >       >>>>
> >       >>>> You already pointed your finger in the (or at least one) right
> > direction
> >       >>>> and everybody will agree that this setup is woefully
> > underpowered in the
> >       >>>> CPU department.
> >       >>>>
> >       >>>>> I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm
> > profile
> >       >>>>> for throughput-performance enabled.
> >       >>>>>
> >       >>>> Ceph version would be interesting as well...
> >       >>>>
> >       >>>>> I have these sysctls set:
> >       >>>>>
> >       >>>>> kernel.pid_max = 4194303
> >       >>>>> fs.file-max = 6553600
> >       >>>>> vm.swappiness = 0
> >       >>>>> vm.vfs_cache_pressure = 50
> >       >>>>> vm.min_free_kbytes = 3145728
> >       >>>>>
> >       >>>>> I feel like my issue is directly related to the high number
> > of OSD per
> >       >>>>> host but I'm not sure what issue I'm really running into.   I
> > believe
> >       >>>>> that I have ruled out network issues, i am able to get 38Gbit
> >       >>>>> consistently via iperf testing and mtu for jump pings
> > successfully
> >       >>>>> with no fragment set and 8972 packet size.
> >       >>>>>
> >       >>>> The fact that it all works for days at a time suggests this as
> > well, but
> >       >>>> you need to verify these things when they're happening.
> >       >>>>
> >       >>>>>  From FIO testing I seem to be able to get 150-200k iops
> > write from my
> >       >>>>> rbd clients on 1gbit networking... This is about what I
> > expected due
> >       >>>>> to the write penalty and my underpowered CPU for the number
> > of OSD.
> >       >>>>>
> >       >>>>> I get these messages which I believe are normal?
> >       >>>>> 2018-08-22 10:33:12.754722 7f7d009f5700  0 --
> > 10.20.136.8:6894/718902
> >       >>>>>>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502
> > s=2
> >       >>>>> pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to
> > send, going
> >       >>>>> to standby
> >       >>>>>
> >       >>>> Ignore.
> >       >>>>
> >       >>>>> Then randomly I'll get a storm of this every few days for 20
> > minutes or so:
> >       >>>>> 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
> >       >>>>> heartbeat_check: no reply from 10.20.142.11:6861 osd.198
> > since back
> >       >>>>> 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890
> > (cutoff
> >       >>>>> 2018-08-22 15:48:12.630773)
> >       >>>>>
> >       >>>> Randomly is unlikely.
> >       >>>> Again, catch it in the act, atop in huge terminal windows
> > (showing all
> >       >>>> CPUs and disks) for all nodes should be very telling,
> > collecting and
> >       >>>> graphing this data might work, too.
> >       >>>>
> >       >>>> My suspects would be deep scrubs and/or high IOPS spikes when
> > this is
> >       >>>> happening, starving out OSD processes (CPU wise, RAM should be
> > fine one
> >       >>>> supposes).
> >       >>>>
> >       >>>> Christian
> >       >>>>
> >       >>>>> Please help!!!
> >       >> Have you looked at the OSD logs on the OSD nodes by chance?  I
> > found
> >       >> that correlating the messages in those logs with your master
> > ceph log
> >       >> and also correlating with any messages in syslog or kern.log can
> >       >> elucidate the cause of the problem pretty well.
> >       >> --
> >       >> Alex Gorbachev
> >       >> Storcium
> >       >>
> >       >>
> >       >>>>> _______________________________________________
> >       >>>>> ceph-users mailing list
> >       >>>>> [email protected]
> >       >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >       >>>>>
> >       >>>>
> >       >>>> --
> >       >>>> Christian Balzer        Network/Systems Engineer
> >       >>>> [email protected]           Rakuten Communications
> >       >>> _______________________________________________
> >       >>> ceph-users mailing list
> >       >>> [email protected]
> >       >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >       > _______________________________________________
> >       > ceph-users mailing list
> >       > [email protected]
> >       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Stability Issue with 52 OSD hosts

Reply via email to