Re: [ceph-users] Slow/Hung IOs

Christian Balzer Wed, 07 Jan 2015 21:22:06 -0800

Hello,

On Thu, 8 Jan 2015 00:17:11 +0000 Sanders, Bill wrote:


> Thanks for your reply, Christian.  Sorry for my delay in responding.
> 
> The kernel logs are silent.  Forgot to mention before that ntpd is
> running and the nodes are sync'd.
> 
> I'm working on some folks for an updated kernel, but I'm not holding my
> breath.  That said, If I'm seeing this problem by running rados bench on
> the storage cluster itself, is it fair to say that the kernel code isn't
> the issue?
> 
Well, aside from such nuggets as:
http://tracker.ceph.com/issues/6301 
(which you're obviously not facing, but still)
most people tend to run Ceph with the latest stable-ish kernels for a
variety of reasons. 
If nothing else, you're going to hopefully get some other improvements and
are able to compare notes with a broader group of Ceph users. 

> vm/min_free_kbytes is now set to 512M, though that didn't solve the
> issue.  
I wasn't expecting it too, but if you look at threads as recent as this
one:
http://comments.gmane.org/gmane.comp.file-systems.ceph.user/15167

Setting this with IB HCAs makes a lot of sense.

> I also set "filestore_max_sync_interval = 30" (and commented out
> the journal line) as you suggested, but that didn't seem to change
> anything, either.  

That setting could/should improve journal utilization, it has nothing to
do per se with your problem. Of course you will need to restart all OSDs
(and make sure the change took effect by looking at the active
configuration via the admin socket). 

> Not sure what you mean about the monitors and
> SSD's... they currently *are* hosted on SSD's, which don't appear to be 
> 
Cut off in the middle of the sentence?
Anyways, from your description "2x1TB spinners configured in RAID for the
OS" I have to assume that /var/lib/ceph/ is part of that RAID and that's
where the monitors keep their very active leveldb. 
It really likes to be on SSDs, I could make monitors go wonky on a similar
setup when running bonnie++ on those OS disks.

> When rados bench starts, atop (holy crap that's a lot of info) shows
> that the HDD's go crazy for a little while (busy >85%).  The SSD's never
> get that busy (certainly <50%).  I attached a few 'snapshots' of atop
> taken just after the test starts (~12s), while it was still running
> (~30s), and after the test was supposed to have ended (~70s), but was
> essentially waiting for slow-requests.  The only thing red-lining at all
> were the HDD's
> 
Yeah, atop is quite informative in a big window and if you think that's
TMI, look at the performance counters on each OSD as I mentioned earlier.
"ceph --admin-daemon /var/run/ceph/ceph-osd.16.asok perf dump"

HDDs are supposed to get 100% busy and nothing stands out in particular.
Was one of those disk (this node) part of a slow request?

I find irqbalance clumsy and often plain wrong, but while your top IRQ
load is nothing to worry about you might want to investigate separating
your network and disk controller IRQs onto separate (real) cores (but
within the same CPU/numa region).

> I wonder how I could test our network.  Are you thinking its possible
> we're losing packets?  I'll ping (har!) our network guy... 
> 
Network people tend to run away screaming when mentioning IB, that's why
I'm the IB guy here and not the 4 (in our team alone) network guys. 

What exactly are you using (hardware, IB stack, IPoIB mode) and are those
single ports or are they bonded?

> I have to admit that the OSD logs don't mean a whole lot to me.  Are OSD
> log entries like this normal?  This is not from during the test, but
> just before when the system was essentially idle.
> 
> 2015-01-07 15:38:40.340883 7fa264ff7700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6806/47930 pipe(0x7fa268c14480 sd=111 :40639 s=2 pgs=559
> cs=13 l=0 c=0x7fa283060080).fault with nothing to send, going to standby
> 2015-01-07 15:38:53.573890 7fa2b99f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6805/23130 pipe(0x7fa268c55800 sd=127 :6800 s=2 pgs=152 cs=13
> l=0 c=0x7fa268c17e00).fault with nothing to send, going to standby
> 2015-01-07 15:38:55.881934 7fa281bfd700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6809/44433 pipe(0x7fa268c12180 sd=65 :41550 s=2 pgs=599 cs=19
> l=0 c=0x7fa28305fc00).fault with nothing to send, going to standby
> 2015-01-07 15:38:56.360866 7fa29e1f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6820/48681 pipe(0x7fa268c14980 sd=145 :6800 s=2 pgs=500 cs=21
> l=0 c=0x7fa28305fa80).fault with nothing to send, going to standby
> 2015-01-07 15:38:58.767181 7fa2a85f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6820/48681 pipe(0x7fa268c55d00 sd=52 :6800 s=0 pgs=0 cs=0 l=0
> c=0x7fa268c18b80).accept connect_seq 22 vs existing 21 state standby
> 2015-01-07 15:38:58.943514 7fa253cf0700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6805/23130 pipe(0x7fa268c55f80 sd=49 :6800 s=0 pgs=0 cs=0 l=0
> c=0x7fa268c18d00).accept connect_seq 14 vs existing 13 state standby
> 
Totally normal.

> 
> For the OSD complaining about slow requests its logs show something like
> during the test:
> 
> 2015-01-07 15:47:28.463470 7fc0714f0700  0 -- 39.7.48.7:6812/16907 >>
> 39.7.48.4:0/3544514455 pipe(0x7fc08f827a80 sd=153 :6812 s=0 pgs=0 cs=0
> l=0 c=0x7fc08f882580).accept peer addr is really 39.7.48.4:0/3544514455
> (socket is 39.7.48.4:464 35/0) 2015-01-07 15:48:04.426399 7fc0e9bfd700
> 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for >
> 30.738429 secs 2015-01-07 15:48:04.426416 7fc0e9bfd700  0 log [WRN] :
> slow request 30.738429 seconds old, received at 2015-01-07
> 15:47:33.687935: osd_op(client.92886.0:4711
> benchmark_data_tvsaq1_29431_object4710 [write 0~4194304] 3.1639422f
> ack+ondisk+ write e1464) v4 currently waiting for subops from 22,36
> 2015-01-07 15:48:34.429979 7fc0e9bfd700  0 log [WRN] : 1 slow requests,
> 1 included below; oldest blocked for > 60.742016 secs 2015-01-07
> 15:48:34.429997 7fc0e9bfd700  0 log [WRN] : slow request 60.742016
> seconds old, received at 2015-01-07 15:47:33.687935:
> osd_op(client.92886.0:4711 benchmark_data_tvsaq1_29431_object4710 [write
> 0~4194304] 3.1639422f ack+ondisk+ write e1464) v4 currently waiting for
> subops from 22,36
> 
Which is "normal" and unfortunately not particular informative.

Look at things with:
ceph --admin-daemon /var/run/ceph/ceph-osd.[slowone].asok dump_historic_ops 
when it happens. 

Christian

> ________________________________________
> From: Christian Balzer [ch...@gol.com]
> Sent: Tuesday, January 06, 2015 12:25 AM
> To: ceph-users@lists.ceph.com
> Cc: Sanders, Bill
> Subject: Re: [ceph-users] Slow/Hung IOs
> 
> On Mon, 5 Jan 2015 22:36:29 +0000 Sanders, Bill wrote:
> 
> > Hi Ceph Users,
> >
> > We've got a Ceph cluster we've built, and we're experiencing issues
> > with slow or hung IO's, even running 'rados bench' on the OSD cluster.
> > Things start out great, ~600 MB/s, then rapidly drops off as the test
> > waits for IO's. Nothing seems to be taxed... the system just seems to
> > be waiting.  Any help trying to figure out what could cause the slow
> > IO's is appreciated.
> >
> I assume nothing in the logs of the respective OSDs either?
> Kernel or other logs equally silent?
> 
> Watching things with atop (while running the test) not showing anything
> particular?
> 
> Looking at the myriad of throttles and other data in
> http://ceph.com/docs/next/dev/perf_counters/
> might be helpful for the affected OSDs.
> 
> Having this kind of (consistent?) trouble feels like a networking issue
> of sorts, OSDs not able to reach each other or something massively
> messed up in the I/O stack.
> 
> [snip]
> 
> > Our ceph cluster is 4x Dell R720xd nodes:
> > 2x1TB spinners configured in RAID for the OS
> > 10x4TB spinners for OSD's (XFS)
> > 2x400GB SSD's, each with 5x~50GB OSD journals
> > 2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)
> > 128GB RAM
> > Two networks (public+cluster), both over infiniband
> >
> Usual IB kernel tuning done, network stack stuff and vm/min_free_kbytes
> to 512MB at least?
> 
> > Three monitors are configured on the first three nodes, and use a chunk
> > of one of the SSDs for their data, on an XFS partition
> >
> Since you see nothing in the logs probably not your issue, but monitors
> like the I/O for their leveldb fast, SSD recommended.
> 
> > Software:
> > SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client"
> > backported from 3.10) Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE
> >
> Can't get a 3.16 backport for this?
> 
> > ceph.conf:
> > fsid = 3e8dbfd8-c3c8-4d30-80e2-cd059619d757
> > mon initial members = tvsaq1, tvsaq2, tvsar1
> > mon host = 39.7.48.6, 39.7.48.7, 39.7.48.8
> >
> > cluster network = 39.64.0.0/12
> > public network = 39.0.0.0/12
> > auth cluster required = cephx
> > auth service required = cephx
> > auth client required = cephx
> > osd journal size = 9000
> Not sure how this will affect things given that you have 50GB partitions.
> 
> I'd remove that line and replace it with something like:
> 
>  filestore_max_sync_interval = 30
> 
> (I use 10 with 10GB journals)
> 
> Regards,
> 
> Christian
> 
> > filestore xattr use omap = true
> > osd crush update on start = false
> > osd pool default size = 3
> > osd pool default min size = 1
> > osd pool default pg num = 4096
> > osd pool default pgp num = 4096
> >
> > mon clock drift allowed = .100
> > osd mount options xfs = rw,noatime,inode64
> >
> >
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow/Hung IOs

Reply via email to