Re: [ceph-users] sync writes - expected performance?

2015-12-16 Thread Nikola Ciprich
Hello Mark,

thanks for your explanation, it all makes sense. I've done
some measuring on google and amazon clouds as well and really,
those numbers seem to be pretty good. I'll be playing with
fine tunning a little bit more, but overall performance
really seems to be quite nice.

Thanks to all of you for your replies guys!

nik


On Mon, Dec 14, 2015 at 11:03:16AM -0600, Mark Nelson wrote:
> 
> 
> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
> >Hello,
> >
> >i'm doing some measuring on test (3 nodes) cluster and see strange 
> >performance
> >drop for sync writes..
> >
> >I'm using SSD for both journalling and OSD. It should be suitable for
> >journal, giving about 16.1KIOPS (67MB/s) for sync IO.
> >
> >(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write 
> >--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting 
> >--name=journal-test)
> >
> >On top of this cluster, I have running KVM guest (using qemu librbd backend).
> >Overall performance seems to be quite good, but the problem is when I try
> >to measure sync IO performance inside the guest.. I'm getting only about 
> >600IOPS,
> >which I think is quite poor.
> >
> >The problem is, I don't see any bottlenect, OSD daemons don't seem to be 
> >hanging on
> >IO, neither hogging CPU, qemu process is also not somehow too much loaded..
> >
> >I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging 
> >disabled,
> >
> >my question is, what results I can expect for synchronous writes? I 
> >understand
> >there will always be some performance drop, but 600IOPS on top of storage 
> >which
> >can give as much as 16K IOPS seems to little..
> 
> So basically what this comes down to is latency.  Since you get 16K IOPS for
> O_DSYNC writes on the SSD, there's a good chance that it has a
> super-capacitor on board and can basically acknowledge a write as complete
> as soon as it hits the on-board cache rather than when it's written to
> flash.  Figure that for 16K O_DSYNC IOPs means that each IO is completing in
> around 0.06ms on average.  That's very fast!  At 600 IOPs for O_DSYNC writes
> on your guest, you're looking at about 1.6ms per IO on average.
> 
> So how do we account for the difference?  Let's start out by looking at a
> quick example of network latency (This is between two random machines in one
> of our labs at Red Hat):
> 
> >64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
> >64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
> >64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
> >64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
> >64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
> 
> now consider that when you do a write in ceph, you write to the primary OSD
> which then writes out to the replica OSDs.  Every replica IO has to complete
> before the primary will send the acknowledgment to the client (ie you have
> to add the latency of the worst of the replica writes!). In your case, the
> network latency alone is likely dramatically increasing IO latency vs raw
> SSD O_DSYNC writes.  Now add in the time to process crush mappings, look up
> directory and inode metadata on the filesystem where objects are stored
> (assuming it's not cached), and other processing time, and the 1.6ms latency
> for the guest writes starts to make sense.
> 
> Can we improve things?  Likely yes.  There's various areas in the code where
> we can trim latency away, implement alternate OSD backends, and potentially
> use alternate network technology like RDMA to reduce network latency.  The
> thing to remember is that when you are talking about O_DSYNC writes, even
> very small increases in latency can have dramatic effects on performance.
> Every fraction of a millisecond has huge ramifications.
> 
> >
> >Has anyone done similar measuring?
> >
> >thanks a lot in advance!
> >
> >BR
> >
> >nik
> >
> >
> >
> >
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpcTqptKGKxY.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] sync writes - expected performance?

2015-12-14 Thread Nikola Ciprich
Hello,

i'm doing some measuring on test (3 nodes) cluster and see strange performance
drop for sync writes..

I'm using SSD for both journalling and OSD. It should be suitable for
journal, giving about 16.1KIOPS (67MB/s) for sync IO.

(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write --bs=4k 
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting 
--name=journal-test)

On top of this cluster, I have running KVM guest (using qemu librbd backend).
Overall performance seems to be quite good, but the problem is when I try
to measure sync IO performance inside the guest.. I'm getting only about 
600IOPS,
which I think is quite poor.

The problem is, I don't see any bottlenect, OSD daemons don't seem to be 
hanging on
IO, neither hogging CPU, qemu process is also not somehow too much loaded..

I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging disabled,

my question is, what results I can expect for synchronous writes? I understand
there will always be some performance drop, but 600IOPS on top of storage which
can give as much as 16K IOPS seems to little..

Has anyone done similar measuring?

thanks a lot in advance!

BR

nik


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpXMhY4Ixq8A.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
Which SSD are you using? Dsync flag will dramatically slow down most SSDs.
You¹ve got to be very careful about the SSD you pick.

Warren Wang




On 12/14/15, 5:49 AM, "Nikola Ciprich"  wrote:

>Hello,
>
>i'm doing some measuring on test (3 nodes) cluster and see strange
>performance
>drop for sync writes..
>
>I'm using SSD for both journalling and OSD. It should be suitable for
>journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>
>(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>--group_reporting --name=journal-test)
>
>On top of this cluster, I have running KVM guest (using qemu librbd
>backend).
>Overall performance seems to be quite good, but the problem is when I try
>to measure sync IO performance inside the guest.. I'm getting only about
>600IOPS,
>which I think is quite poor.
>
>The problem is, I don't see any bottlenect, OSD daemons don't seem to be
>hanging on
>IO, neither hogging CPU, qemu process is also not somehow too much
>loaded..
>
>I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>disabled,
>
>my question is, what results I can expect for synchronous writes? I
>understand
>there will always be some performance drop, but 600IOPS on top of storage
>which
>can give as much as 16K IOPS seems to little..
>
>Has anyone done similar measuring?
>
>thanks a lot in advance!
>
>BR
>
>nik
>
>
>-- 
>-
>Ing. Nikola CIPRICH
>LinuxBox.cz, s.r.o.
>28.rijna 168, 709 00 Ostrava
>
>tel.:   +420 591 166 214
>fax:+420 596 621 273
>mobil:  +420 777 093 799
>www.linuxbox.cz
>
>mobil servis: +420 737 238 656
>email servis: ser...@linuxbox.cz
>-

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Mark Nelson



On 12/14/2015 04:49 AM, Nikola Ciprich wrote:

Hello,

i'm doing some measuring on test (3 nodes) cluster and see strange performance
drop for sync writes..

I'm using SSD for both journalling and OSD. It should be suitable for
journal, giving about 16.1KIOPS (67MB/s) for sync IO.

(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write --bs=4k 
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting 
--name=journal-test)

On top of this cluster, I have running KVM guest (using qemu librbd backend).
Overall performance seems to be quite good, but the problem is when I try
to measure sync IO performance inside the guest.. I'm getting only about 
600IOPS,
which I think is quite poor.

The problem is, I don't see any bottlenect, OSD daemons don't seem to be 
hanging on
IO, neither hogging CPU, qemu process is also not somehow too much loaded..

I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging disabled,

my question is, what results I can expect for synchronous writes? I understand
there will always be some performance drop, but 600IOPS on top of storage which
can give as much as 16K IOPS seems to little..


So basically what this comes down to is latency.  Since you get 16K IOPS 
for O_DSYNC writes on the SSD, there's a good chance that it has a 
super-capacitor on board and can basically acknowledge a write as 
complete as soon as it hits the on-board cache rather than when it's 
written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO 
is completing in around 0.06ms on average.  That's very fast!  At 600 
IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per 
IO on average.


So how do we account for the difference?  Let's start out by looking at 
a quick example of network latency (This is between two random machines 
in one of our labs at Red Hat):



64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms


now consider that when you do a write in ceph, you write to the primary 
OSD which then writes out to the replica OSDs.  Every replica IO has to 
complete before the primary will send the acknowledgment to the client 
(ie you have to add the latency of the worst of the replica writes!). 
In your case, the network latency alone is likely dramatically 
increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to 
process crush mappings, look up directory and inode metadata on the 
filesystem where objects are stored (assuming it's not cached), and 
other processing time, and the 1.6ms latency for the guest writes starts 
to make sense.


Can we improve things?  Likely yes.  There's various areas in the code 
where we can trim latency away, implement alternate OSD backends, and 
potentially use alternate network technology like RDMA to reduce network 
latency.  The thing to remember is that when you are talking about 
O_DSYNC writes, even very small increases in latency can have dramatic 
effects on performance.  Every fraction of a millisecond has huge 
ramifications.




Has anyone done similar measuring?

thanks a lot in advance!

BR

nik




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
Whoops, I misread Nikola¹s original email, sorry!

If all your SSDs are all performing at that level for sync IO, then I
agree that it¹s down to other things, like network latency and PG locking.
Sequential 4K writes with 1 thread and 1 qd is probably the worst
performance you¹ll see. Is there a router between your VM and the Ceph
cluster, or one between Ceph nodes for the cluster network?

Are you using dsync at the VM level to simulate what a database or other
app would do? If you can switch to directIO, you¹ll likely get far better
performance. 

Warren Wang




On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:

>
>
>On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>> Hello,
>>
>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>performance
>> drop for sync writes..
>>
>> I'm using SSD for both journalling and OSD. It should be suitable for
>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>
>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>--group_reporting --name=journal-test)
>>
>> On top of this cluster, I have running KVM guest (using qemu librbd
>>backend).
>> Overall performance seems to be quite good, but the problem is when I
>>try
>> to measure sync IO performance inside the guest.. I'm getting only
>>about 600IOPS,
>> which I think is quite poor.
>>
>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>be hanging on
>> IO, neither hogging CPU, qemu process is also not somehow too much
>>loaded..
>>
>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>disabled,
>>
>> my question is, what results I can expect for synchronous writes? I
>>understand
>> there will always be some performance drop, but 600IOPS on top of
>>storage which
>> can give as much as 16K IOPS seems to little..
>
>So basically what this comes down to is latency.  Since you get 16K IOPS
>for O_DSYNC writes on the SSD, there's a good chance that it has a
>super-capacitor on board and can basically acknowledge a write as
>complete as soon as it hits the on-board cache rather than when it's
>written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>is completing in around 0.06ms on average.  That's very fast!  At 600
>IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>IO on average.
>
>So how do we account for the difference?  Let's start out by looking at
>a quick example of network latency (This is between two random machines
>in one of our labs at Red Hat):
>
>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>
>now consider that when you do a write in ceph, you write to the primary
>OSD which then writes out to the replica OSDs.  Every replica IO has to
>complete before the primary will send the acknowledgment to the client
>(ie you have to add the latency of the worst of the replica writes!).
>In your case, the network latency alone is likely dramatically
>increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>process crush mappings, look up directory and inode metadata on the
>filesystem where objects are stored (assuming it's not cached), and
>other processing time, and the 1.6ms latency for the guest writes starts
>to make sense.
>
>Can we improve things?  Likely yes.  There's various areas in the code
>where we can trim latency away, implement alternate OSD backends, and
>potentially use alternate network technology like RDMA to reduce network
>latency.  The thing to remember is that when you are talking about
>O_DSYNC writes, even very small increases in latency can have dramatic
>effects on performance.  Every fraction of a millisecond has huge
>ramifications.
>
>>
>> Has anyone done similar measuring?
>>
>> thanks a lot in advance!
>>
>> BR
>>
>> nik
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Jan Schermer
Even with 10G ethernet, the bottleneck is not the network, nor the drives 
(assuming they are datacenter-class). The bottleneck is the software.
The only way to improve that is to either increase CPU speed (more GHz per 
core) or to simplify the datapath IO has to take before it is considered 
durable.
Stuff like RDMA will help only if there so zero-copy between the (RBD) client 
and the drive, or if the write is acknowledged when in the remote buffers of 
replicas (but it still has to come from client directly or RDMA becomes a bit 
pointless, IMHO).

Databases do sync writes for a reason, O_DIRECT doesn't actually make strong 
guarantees on ordering or buffering, though in practice the race condition is 
negligible.

Your 600 IOPS are pretty good actually.

Jan


> On 14 Dec 2015, at 22:58, Warren Wang - ISD  wrote:
> 
> Whoops, I misread Nikola¹s original email, sorry!
> 
> If all your SSDs are all performing at that level for sync IO, then I
> agree that it¹s down to other things, like network latency and PG locking.
> Sequential 4K writes with 1 thread and 1 qd is probably the worst
> performance you¹ll see. Is there a router between your VM and the Ceph
> cluster, or one between Ceph nodes for the cluster network?
> 
> Are you using dsync at the VM level to simulate what a database or other
> app would do? If you can switch to directIO, you¹ll likely get far better
> performance. 
> 
> Warren Wang
> 
> 
> 
> 
> On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:
> 
>> 
>> 
>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>>> Hello,
>>> 
>>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>> performance
>>> drop for sync writes..
>>> 
>>> I'm using SSD for both journalling and OSD. It should be suitable for
>>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>> 
>>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>> --group_reporting --name=journal-test)
>>> 
>>> On top of this cluster, I have running KVM guest (using qemu librbd
>>> backend).
>>> Overall performance seems to be quite good, but the problem is when I
>>> try
>>> to measure sync IO performance inside the guest.. I'm getting only
>>> about 600IOPS,
>>> which I think is quite poor.
>>> 
>>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>> be hanging on
>>> IO, neither hogging CPU, qemu process is also not somehow too much
>>> loaded..
>>> 
>>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>> disabled,
>>> 
>>> my question is, what results I can expect for synchronous writes? I
>>> understand
>>> there will always be some performance drop, but 600IOPS on top of
>>> storage which
>>> can give as much as 16K IOPS seems to little..
>> 
>> So basically what this comes down to is latency.  Since you get 16K IOPS
>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>> super-capacitor on board and can basically acknowledge a write as
>> complete as soon as it hits the on-board cache rather than when it's
>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>> is completing in around 0.06ms on average.  That's very fast!  At 600
>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>> IO on average.
>> 
>> So how do we account for the difference?  Let's start out by looking at
>> a quick example of network latency (This is between two random machines
>> in one of our labs at Red Hat):
>> 
>>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>> 
>> now consider that when you do a write in ceph, you write to the primary
>> OSD which then writes out to the replica OSDs.  Every replica IO has to
>> complete before the primary will send the acknowledgment to the client
>> (ie you have to add the latency of the worst of the replica writes!).
>> In your case, the network latency alone is likely dramatically
>> increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>> process crush mappings, look up directory and inode metadata on the
>> filesystem where objects are stored (assuming it's not cached), and
>> other processing time, and the 1.6ms latency for the guest writes starts
>> to make sense.
>> 
>> Can we improve things?  Likely yes.  There's various areas in the code
>> where we can trim latency away, implement alternate OSD backends, and
>> potentially use alternate network technology like RDMA to reduce network
>> latency.  The thing to remember is that when you are talking about
>> O_DSYNC writes, even very small increases in latency can have dramatic
>> effects on performance.  Every fraction of a millisecond has huge
>> 

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD
I get where you are coming from, Jan, but for a test this small, I still
think checking network latency first for a single op is a good idea.

Given that the cluster is not being stressed, CPUs may be running slow. It
may also benefit the test to turn CPU governors to performance for all
cores.

Warren Wang




On 12/14/15, 5:07 PM, "Jan Schermer"  wrote:

>Even with 10G ethernet, the bottleneck is not the network, nor the drives
>(assuming they are datacenter-class). The bottleneck is the software.
>The only way to improve that is to either increase CPU speed (more GHz
>per core) or to simplify the datapath IO has to take before it is
>considered durable.
>Stuff like RDMA will help only if there so zero-copy between the (RBD)
>client and the drive, or if the write is acknowledged when in the remote
>buffers of replicas (but it still has to come from client directly or
>RDMA becomes a bit pointless, IMHO).
>
>Databases do sync writes for a reason, O_DIRECT doesn't actually make
>strong guarantees on ordering or buffering, though in practice the race
>condition is negligible.
>
>Your 600 IOPS are pretty good actually.
>
>Jan
>
>
>> On 14 Dec 2015, at 22:58, Warren Wang - ISD 
>>wrote:
>> 
>> Whoops, I misread Nikola易s original email, sorry!
>> 
>> If all your SSDs are all performing at that level for sync IO, then I
>> agree that it易s down to other things, like network latency and PG
>>locking.
>> Sequential 4K writes with 1 thread and 1 qd is probably the worst
>> performance you易ll see. Is there a router between your VM and the Ceph
>> cluster, or one between Ceph nodes for the cluster network?
>> 
>> Are you using dsync at the VM level to simulate what a database or other
>> app would do? If you can switch to directIO, you易ll likely get far
>>better
>> performance. 
>> 
>> Warren Wang
>> 
>> 
>> 
>> 
>> On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:
>> 
>>> 
>>> 
>>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
 Hello,
 
 i'm doing some measuring on test (3 nodes) cluster and see strange
 performance
 drop for sync writes..
 
 I'm using SSD for both journalling and OSD. It should be suitable for
 journal, giving about 16.1KIOPS (67MB/s) for sync IO.
 
 (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
 --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
 --group_reporting --name=journal-test)
 
 On top of this cluster, I have running KVM guest (using qemu librbd
 backend).
 Overall performance seems to be quite good, but the problem is when I
 try
 to measure sync IO performance inside the guest.. I'm getting only
 about 600IOPS,
 which I think is quite poor.
 
 The problem is, I don't see any bottlenect, OSD daemons don't seem to
 be hanging on
 IO, neither hogging CPU, qemu process is also not somehow too much
 loaded..
 
 I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
 disabled,
 
 my question is, what results I can expect for synchronous writes? I
 understand
 there will always be some performance drop, but 600IOPS on top of
 storage which
 can give as much as 16K IOPS seems to little..
>>> 
>>> So basically what this comes down to is latency.  Since you get 16K
>>>IOPS
>>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>>> super-capacitor on board and can basically acknowledge a write as
>>> complete as soon as it hits the on-board cache rather than when it's
>>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>>> is completing in around 0.06ms on average.  That's very fast!  At 600
>>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms
>>>per
>>> IO on average.
>>> 
>>> So how do we account for the difference?  Let's start out by looking at
>>> a quick example of network latency (This is between two random machines
>>> in one of our labs at Red Hat):
>>> 
 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>>> 
>>> now consider that when you do a write in ceph, you write to the primary
>>> OSD which then writes out to the replica OSDs.  Every replica IO has to
>>> complete before the primary will send the acknowledgment to the client
>>> (ie you have to add the latency of the worst of the replica writes!).
>>> In your case, the network latency alone is likely dramatically
>>> increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time
>>>to
>>> process crush mappings, look up directory and inode metadata on the
>>> filesystem where objects are stored (assuming it's not cached), and
>>> other processing time, and the 1.6ms latency for the