On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
>
> On 08/04/14 10:39, Christian Balzer wrote:
> > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
> >
> >> On 08/04/14 10:04, Christian Balzer wrote:
> >>> Hello,
> >>>
> >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I am currently benchmarking a standard setup with Intel DC S3700
> >>>> disks as journals and Hitachi 4TB-disks as data-drives, all on LACP
> >>>> 10GbE network.
> >>>>
> >>> Unless that is the 400GB version of the DC3700, you're already
> >>> limiting yourself to 365MB/s throughput with the 200GB variant.
> >>> If sequential write speed is that important to you and you think
> >>> you'll ever get those 5 HDs to write at full speed with Ceph
> >>> (unlikely).
> >> It's the 400GB version of the DC3700, and yes, I'm aware that I need a
> >> 1:3 ratio to max out these disks, as they write sequential data at
> >> about 150MB/s.
> >> But our thoughts are that it would cover the current demand with a 1:5
> >> ratio, but we could upgrade.
> > I'd reckon you'll do fine, as in run out of steam and IOPS before
> > hitting that limit.
> >
> >>>> The size of my journals are 25GB each, and I have two journals per
> >>>> machine, with 5 OSDs per journal, with 5 machines in total. We
> >>>> currently use the tunables optimal and the version of ceph is the
> >>>> latest dumpling.
> >>>>
> >>>> Benchmarking writes with rbd show that there's no problem hitting
> >>>> upper levels on the 4TB-disks with sequential data, thus maxing out
> >>>> 10GbE. At this moment we see full utilization on the journals.
> >>>>
> >>>> But lowering the byte-size to 4k shows that the journals are
> >>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p <pool> -b
> >>>> 4096 -t 256 100 write)
> >>>>
> >>> When you're saying utilization I assume you're talking about iostat
> >>> or atop output?
> >> Yes, the utilization is iostat.
> >>> That's not a bug, that's comparing apple to oranges.
> >> You mean comparing iostat-results with the ones from rados benchmark?
> >>> The rados bench default is 4MB, which not only happens to be the
> >>> default RBD objectsize but also to generate a nice amount of
> >>> bandwidth.
> >>>
> >>> While at 4k writes your SDD is obviously bored, but actual OSD needs
> >>> to handle all those writes and becomes limited by the IOPS it can
> >>> peform.
> >> Yes, it's quite bored and just shuffles data.
> >> Maybe I've been thinking about this the wrong way,
> >> but shouldn't the Journal buffer more until the Journal partition is
> >> full or when the flush_interval is met.
> >>
> > Take a look at "journal queue max ops", which has a default of a mere
> > 500, so that's full after 2 seconds. ^o^
> Hm, that makes sense.
>
> So, tested out both low values ( 5000 ) and large value ( 6553600 ),
> but it didn't seem that change anything.
> Any chance I could dump the current values from a running OSD, to
> actually see what is in use?
>
The value can be checked like this (for example):
ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
If you restarted your OSD after updating ceph.conf I'm sure you will find
the values you set there.
However you are seriously underestimating the packet storm you're
unleashing with 256 threads of 4KB packets over a 10Gb/s link.
That's theoretically 256K packets/s, very quickly filling even your
"large" max ops setting.
Also the "journal max write entries" will need to be adjusted to suit the
abilities (speed and merge wise) of your OSDs.
With 40 million max ops and 2048 max write I get this (instead of similar
values to you with the defaults):
1 256 2963 2707 10.5707 10.5742 0.125177 0.0830565
2 256 5278 5022 9.80635 9.04297 0.247757 0.0968146
3 256 7276 7020 9.13867 7.80469 0.002813 0.0994022
4 256 8774 8518 8.31665 5.85156 0.002976 0.107339
5 256 10121 9865 7.70548 5.26172 0.002569 0.117767
6 256 11363 11107 7.22969 4.85156 0.38909 0.130649
7 256 12354 12098 6.7498 3.87109 0.002857 0.137199
8 256 12392 12136 5.92465 0.148438 1.15075 0.138359
9 256 12551 12295 5.33538 0.621094 0.003575 0.151978
10 256 13099 12843 5.0159 2.14062 0.146283 0.17639
Of course this tails off eventually, but the effect is obvious and the
bandwidth is double that of the default values.
I'm sure some inktank person will pipe up momentarily as to why these
defaults were chosen and why such huge values are to be avoided. ^.-
Regards,
Christian
> Cheers,
> Josef
> > Cheers,
> >
> > Christian
> >
> >> Right now the rados benchmark gets about 1MB/s throughput. I really
> >> don't know what is expected though, but it seems quite slow.
> >>
> >> sudo rados bench -p shared-1 -b 4096 300 write
> >> Maintaining 16 concurrent writes of 4096 bytes for up to 300 seconds
> >> or 0 objects
> >> Object prefix: benchmark_data_px1_1502
> >> sec Cur ops started finished avg MB/s cur MB/s last lat avg
> >> lat 0 0 0 0 0 0
> >> - 0 1 16 203 187 0.730312 0.730469
> >> 0.030537 0.080467 2 16 397 381 0.744003 0.757812
> >> 0.141118 0.0811331 3 16 625 609 0.792841 0.890625
> >> 0.017979 0.0776631 4 16 889 873 0.852415
> >> 1.03125 0.10221 0.0725933 5 16 1122 1106 0.863941
> >> 0.910156 0.001871 0.0709095 6 16 1437 1421
> >> 0.924995 1.23047 0.035859 0.0665901
> >>
> >> Thanks for helping me out,
> >> Josef
> >>> Regards,
> >>>
> >>> Christian
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
>
>
--
Christian Balzer Network/Systems Engineer
[email protected] Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com