On Tuesday, April 8, 2014, Christian Balzer <[email protected]> wrote: > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote: > > > > On 08/04/14 10:39, Christian Balzer wrote: > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote: > > > > > >> On 08/04/14 10:04, Christian Balzer wrote: > > >>> Hello, > > >>> > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote: > > >>> > > >>>> Hi all, > > >>>> > > >>>> I am currently benchmarking a standard setup with Intel DC S3700 > > >>>> disks as journals and Hitachi 4TB-disks as data-drives, all on LACP > > >>>> 10GbE network. > > >>>> > > >>> Unless that is the 400GB version of the DC3700, you're already > > >>> limiting yourself to 365MB/s throughput with the 200GB variant. > > >>> If sequential write speed is that important to you and you think > > >>> you'll ever get those 5 HDs to write at full speed with Ceph > > >>> (unlikely). > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I need a > > >> 1:3 ratio to max out these disks, as they write sequential data at > > >> about 150MB/s. > > >> But our thoughts are that it would cover the current demand with a 1:5 > > >> ratio, but we could upgrade. > > > I'd reckon you'll do fine, as in run out of steam and IOPS before > > > hitting that limit. > > > > > >>>> The size of my journals are 25GB each, and I have two journals per > > >>>> machine, with 5 OSDs per journal, with 5 machines in total. We > > >>>> currently use the tunables optimal and the version of ceph is the > > >>>> latest dumpling. > > >>>> > > >>>> Benchmarking writes with rbd show that there's no problem hitting > > >>>> upper levels on the 4TB-disks with sequential data, thus maxing out > > >>>> 10GbE. At this moment we see full utilization on the journals. > > >>>> > > >>>> But lowering the byte-size to 4k shows that the journals are > > >>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p <pool> -b > > >>>> 4096 -t 256 100 write) > > >>>> > > >>> When you're saying utilization I assume you're talking about iostat > > >>> or atop output? > > >> Yes, the utilization is iostat. > > >>> That's not a bug, that's comparing apple to oranges. > > >> You mean comparing iostat-results with the ones from rados benchmark? > > >>> The rados bench default is 4MB, which not only happens to be the > > >>> default RBD objectsize but also to generate a nice amount of > > >>> bandwidth. > > >>> > > >>> While at 4k writes your SDD is obviously bored, but actual OSD needs > > >>> to handle all those writes and becomes limited by the IOPS it can > > >>> peform. > > >> Yes, it's quite bored and just shuffles data. > > >> Maybe I've been thinking about this the wrong way, > > >> but shouldn't the Journal buffer more until the Journal partition is > > >> full or when the flush_interval is met. > > >> > > > Take a look at "journal queue max ops", which has a default of a mere > > > 500, so that's full after 2 seconds. ^o^ > > Hm, that makes sense. > > > > So, tested out both low values ( 5000 ) and large value ( 6553600 ), > > but it didn't seem that change anything. > > Any chance I could dump the current values from a running OSD, to > > actually see what is in use? > > > The value can be checked like this (for example): > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show > > If you restarted your OSD after updating ceph.conf I'm sure you will find > the values you set there. > > However you are seriously underestimating the packet storm you're > unleashing with 256 threads of 4KB packets over a 10Gb/s link. > > That's theoretically 256K packets/s, very quickly filling even your > "large" max ops setting. > Also the "journal max write entries" will need to be adjusted to suit the > abilities (speed and merge wise) of your OSDs. > > With 40 million max ops and 2048 max write I get this (instead of similar > values to you with the defaults): > > 1 256 2963 2707 10.5707 10.5742 0.125177 0.0830565 > 2 256 5278 5022 9.80635 9.04297 0.247757 0.0968146 > 3 256 7276 7020 9.13867 7.80469 0.002813 0.0994022 > 4 256 8774 8518 8.31665 5.85156 0.002976 0.107339 > 5 256 10121 9865 7.70548 5.26172 0.002569 0.117767 > 6 256 11363 11107 7.22969 4.85156 0.38909 0.130649 > 7 256 12354 12098 6.7498 3.87109 0.002857 0.137199 > 8 256 12392 12136 5.92465 0.148438 1.15075 0.138359 > 9 256 12551 12295 5.33538 0.621094 0.003575 0.151978 > 10 256 13099 12843 5.0159 2.14062 0.146283 0.17639 > > Of course this tails off eventually, but the effect is obvious and the > bandwidth is double that of the default values. > > I'm sure some inktank person will pipe up momentarily as to why these > defaults were chosen and why such huge values are to be avoided. ^.- >
Just from skimming, those numbers do look a little low, but I'm not sure how all the latencies work out. Anyway, the reason we chose the low numbers is to avoid overloading a backing hard drive, which is going to have a lot more trouble than the journal with a huge backlog of ops. You'll want to test your small IO results for a very long time/with a fairly small journal to check that you don't get a square wave of throughput when waiting for the backing disk to commit everything to disk. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
