On 08/04/14 10:39, Christian Balzer wrote: > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote: > >> On 08/04/14 10:04, Christian Balzer wrote: >>> Hello, >>> >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote: >>> >>>> Hi all, >>>> >>>> I am currently benchmarking a standard setup with Intel DC S3700 disks >>>> as journals and Hitachi 4TB-disks as data-drives, all on LACP 10GbE >>>> network. >>>> >>> Unless that is the 400GB version of the DC3700, you're already limiting >>> yourself to 365MB/s throughput with the 200GB variant. >>> If sequential write speed is that important to you and you think you'll >>> ever get those 5 HDs to write at full speed with Ceph (unlikely). >>> >> It's the 400GB version of the DC3700, and yes, I'm aware that I need a >> 1:3 ratio to max out these disks, as they write sequential data at about >> 150MB/s. >> But our thoughts are that it would cover the current demand with a 1:5 >> ratio, but we could upgrade. > I'd reckon you'll do fine, as in run out of steam and IOPS before hitting > that limit. > >>>> The size of my journals are 25GB each, and I have two journals per >>>> machine, with 5 OSDs per journal, with 5 machines in total. We >>>> currently use the tunables optimal and the version of ceph is the >>>> latest dumpling. >>>> >>>> Benchmarking writes with rbd show that there's no problem hitting >>>> upper levels on the 4TB-disks with sequential data, thus maxing out >>>> 10GbE. At this moment we see full utilization on the journals. >>>> >>>> But lowering the byte-size to 4k shows that the journals are utilized >>>> to about 20%, and the 4TB-disks 100%. (rados -p <pool> -b 4096 -t 256 >>>> 100 write) >>>> >>> When you're saying utilization I assume you're talking about iostat or >>> atop output? >> Yes, the utilization is iostat. >>> That's not a bug, that's comparing apple to oranges. >> You mean comparing iostat-results with the ones from rados benchmark? >>> The rados bench default is 4MB, which not only happens to be the >>> default RBD objectsize but also to generate a nice amount of >>> bandwidth. >>> >>> While at 4k writes your SDD is obviously bored, but actual OSD needs to >>> handle all those writes and becomes limited by the IOPS it can peform. >> Yes, it's quite bored and just shuffles data. >> Maybe I've been thinking about this the wrong way, >> but shouldn't the Journal buffer more until the Journal partition is full >> or when the flush_interval is met. >> > Take a look at "journal queue max ops", which has a default of a mere 500, > so that's full after 2 seconds. ^o^ Hm, that makes sense.
So, tested out both low values ( 5000 ) and large value ( 6553600 ), but it didn't seem that change anything. Any chance I could dump the current values from a running OSD, to actually see what is in use? Cheers, Josef > Cheers, > > Christian > >> Right now the rados benchmark gets about 1MB/s throughput. I really >> don't know what is expected though, but it seems quite slow. >> >> sudo rados bench -p shared-1 -b 4096 300 write >> Maintaining 16 concurrent writes of 4096 bytes for up to 300 seconds or >> 0 objects >> Object prefix: benchmark_data_px1_1502 >> sec Cur ops started finished avg MB/s cur MB/s last lat avg >> lat 0 0 0 0 0 0 - 0 >> 1 16 203 187 0.730312 0.730469 0.030537 >> 0.080467 2 16 397 381 0.744003 0.757812 0.141118 >> 0.0811331 3 16 625 609 0.792841 0.890625 0.017979 >> 0.0776631 4 16 889 873 0.852415 1.03125 0.10221 >> 0.0725933 5 16 1122 1106 0.863941 0.910156 0.001871 >> 0.0709095 6 16 1437 1421 0.924995 1.23047 0.035859 >> 0.0665901 >> >> Thanks for helping me out, >> Josef >>> Regards, >>> >>> Christian >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com