Hello Mark,

thanks for your explanation, it all makes sense. I've done
some measuring on google and amazon clouds as well and really,
those numbers seem to be pretty good. I'll be playing with
fine tunning a little bit more, but overall performance
really seems to be quite nice.

Thanks to all of you for your replies guys!

nik


On Mon, Dec 14, 2015 at 11:03:16AM -0600, Mark Nelson wrote:
> 
> 
> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
> >Hello,
> >
> >i'm doing some measuring on test (3 nodes) cluster and see strange 
> >performance
> >drop for sync writes..
> >
> >I'm using SSD for both journalling and OSD. It should be suitable for
> >journal, giving about 16.1KIOPS (67MB/s) for sync IO.
> >
> >(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write 
> >--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting 
> >--name=journal-test)
> >
> >On top of this cluster, I have running KVM guest (using qemu librbd backend).
> >Overall performance seems to be quite good, but the problem is when I try
> >to measure sync IO performance inside the guest.. I'm getting only about 
> >600IOPS,
> >which I think is quite poor.
> >
> >The problem is, I don't see any bottlenect, OSD daemons don't seem to be 
> >hanging on
> >IO, neither hogging CPU, qemu process is also not somehow too much loaded..
> >
> >I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging 
> >disabled,
> >
> >my question is, what results I can expect for synchronous writes? I 
> >understand
> >there will always be some performance drop, but 600IOPS on top of storage 
> >which
> >can give as much as 16K IOPS seems to little..
> 
> So basically what this comes down to is latency.  Since you get 16K IOPS for
> O_DSYNC writes on the SSD, there's a good chance that it has a
> super-capacitor on board and can basically acknowledge a write as complete
> as soon as it hits the on-board cache rather than when it's written to
> flash.  Figure that for 16K O_DSYNC IOPs means that each IO is completing in
> around 0.06ms on average.  That's very fast!  At 600 IOPs for O_DSYNC writes
> on your guest, you're looking at about 1.6ms per IO on average.
> 
> So how do we account for the difference?  Let's start out by looking at a
> quick example of network latency (This is between two random machines in one
> of our labs at Red Hat):
> 
> >64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
> >64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
> >64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
> >64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
> >64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
> 
> now consider that when you do a write in ceph, you write to the primary OSD
> which then writes out to the replica OSDs.  Every replica IO has to complete
> before the primary will send the acknowledgment to the client (ie you have
> to add the latency of the worst of the replica writes!). In your case, the
> network latency alone is likely dramatically increasing IO latency vs raw
> SSD O_DSYNC writes.  Now add in the time to process crush mappings, look up
> directory and inode metadata on the filesystem where objects are stored
> (assuming it's not cached), and other processing time, and the 1.6ms latency
> for the guest writes starts to make sense.
> 
> Can we improve things?  Likely yes.  There's various areas in the code where
> we can trim latency away, implement alternate OSD backends, and potentially
> use alternate network technology like RDMA to reduce network latency.  The
> thing to remember is that when you are talking about O_DSYNC writes, even
> very small increases in latency can have dramatic effects on performance.
> Every fraction of a millisecond has huge ramifications.
> 
> >
> >Has anyone done similar measuring?
> >
> >thanks a lot in advance!
> >
> >BR
> >
> >nik
> >
> >
> >
> >
> >_______________________________________________
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-------------------------------------

Attachment: pgpcTqptKGKxY.pgp
Description: PGP signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to