Re: [ceph-users] How to detect journal problems

Gregory Farnum Tue, 08 Apr 2014 09:35:35 -0700

On Tuesday, April 8, 2014, Christian Balzer <[email protected]> wrote:

> On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
> >
> > On 08/04/14 10:39, Christian Balzer wrote:
> > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
> > >
> > >> On 08/04/14 10:04, Christian Balzer wrote:
> > >>> Hello,
> > >>>
> > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> I am currently benchmarking a standard setup with Intel DC S3700
> > >>>> disks as journals and Hitachi 4TB-disks as data-drives, all on LACP
> > >>>> 10GbE network.
> > >>>>
> > >>> Unless that is the 400GB version of the DC3700, you're already
> > >>> limiting yourself to 365MB/s throughput with the 200GB variant.
> > >>> If sequential write speed is that important to you and you think
> > >>> you'll ever get those 5 HDs to write at full speed with Ceph
> > >>> (unlikely).
> > >> It's the 400GB version of the DC3700, and yes, I'm aware that I need a
> > >> 1:3 ratio to max out these disks, as they write sequential data at
> > >> about 150MB/s.
> > >> But our thoughts are that it would cover the current demand with a 1:5
> > >> ratio, but we could upgrade.
> > > I'd reckon you'll do fine, as in run out of steam and IOPS before
> > > hitting that limit.
> > >
> > >>>> The size of my journals are 25GB each, and I have two journals per
> > >>>> machine, with 5 OSDs per journal, with 5 machines in total. We
> > >>>> currently use the tunables optimal and the version of ceph is the
> > >>>> latest dumpling.
> > >>>>
> > >>>> Benchmarking writes with rbd show that there's no problem hitting
> > >>>> upper levels on the 4TB-disks with sequential data, thus maxing out
> > >>>> 10GbE. At this moment we see full utilization on the journals.
> > >>>>
> > >>>> But lowering the byte-size to 4k shows that the journals are
> > >>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p <pool> -b
> > >>>> 4096 -t 256 100 write)
> > >>>>
> > >>> When you're saying utilization I assume you're talking about iostat
> > >>> or atop output?
> > >> Yes, the utilization is iostat.
> > >>> That's not a bug, that's comparing apple to oranges.
> > >> You mean comparing iostat-results with the ones from rados benchmark?
> > >>> The rados bench default is 4MB, which not only happens to be the
> > >>> default RBD objectsize but also to generate a nice amount of
> > >>> bandwidth.
> > >>>
> > >>> While at 4k writes your SDD is obviously bored, but actual OSD needs
> > >>> to handle all those writes and becomes limited by the IOPS it can
> > >>> peform.
> > >> Yes, it's quite bored and just shuffles data.
> > >> Maybe I've been thinking about this the wrong way,
> > >> but shouldn't the Journal buffer more until the Journal partition is
> > >> full or when the flush_interval is met.
> > >>
> > > Take a look at "journal queue max ops", which has a default of a mere
> > > 500, so that's full after 2 seconds. ^o^
> > Hm, that makes sense.
> >
> > So, tested out both low values ( 5000 )  and large value ( 6553600 ),
> > but it didn't seem that change anything.
> > Any chance I could dump the current values from a running OSD, to
> > actually see what is in use?
> >
> The value can be checked like this (for example):
> ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
>
> If you restarted your OSD after updating ceph.conf I'm sure you will find
> the values you set there.
>
> However you are seriously underestimating the packet storm you're
> unleashing with 256 threads of 4KB packets over a 10Gb/s link.
>
> That's theoretically 256K packets/s, very quickly filling even your
> "large" max ops setting.
> Also the "journal max write entries" will need to be adjusted to suit the
> abilities (speed and merge wise) of your OSDs.
>
> With 40 million max ops and 2048 max write I get this (instead of similar
> values to you with the defaults):
>
>      1     256      2963      2707   10.5707   10.5742  0.125177 0.0830565
>      2     256      5278      5022   9.80635   9.04297  0.247757 0.0968146
>      3     256      7276      7020   9.13867   7.80469  0.002813 0.0994022
>      4     256      8774      8518   8.31665   5.85156  0.002976  0.107339
>      5     256     10121      9865   7.70548   5.26172  0.002569  0.117767
>      6     256     11363     11107   7.22969   4.85156   0.38909  0.130649
>      7     256     12354     12098    6.7498   3.87109  0.002857  0.137199
>      8     256     12392     12136   5.92465  0.148438   1.15075  0.138359
>      9     256     12551     12295   5.33538  0.621094  0.003575  0.151978
>     10     256     13099     12843    5.0159   2.14062  0.146283   0.17639
>
> Of course this tails off eventually, but the effect is obvious and the
> bandwidth is double that of the default values.
>
> I'm sure some inktank person will pipe up momentarily as to why these
> defaults were chosen and why such huge values are to be avoided. ^.-
>


Just from skimming, those numbers do look a little low, but I'm not sure
how all the latencies work out.

Anyway, the reason we chose the low numbers is to avoid overloading a
backing hard drive, which is going to have a lot more trouble than the
journal with a huge backlog of ops. You'll want to test your small IO
results for a very long time/with a fairly small journal to check that you
don't get a square wave of throughput when waiting for the backing disk to
commit everything to disk.
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to detect journal problems

Reply via email to