On 07/10/2014 03:24 AM, Xabier Elkano wrote:
El 10/07/14 09:18, Christian Balzer escribió:
On Thu, 10 Jul 2014 08:57:56 +0200 Xabier Elkano wrote:

El 09/07/14 16:53, Christian Balzer escribió:
On Wed, 09 Jul 2014 07:07:50 -0500 Mark Nelson wrote:

On 07/09/2014 06:52 AM, Xabier Elkano wrote:
El 09/07/14 13:10, Mark Nelson escribió:
On 07/09/2014 05:57 AM, Xabier Elkano wrote:
Hi,

I was doing some tests in my cluster with fio tool, one fio
instance with 70 jobs, each job writing 1GB random with 4K block
size. I did this test with 3 variations:

1- Creating 70 images, 60GB each, in the pool. Using rbd kernel
module, format and mount each image as ext4. Each fio job writing
in a separate image/directory. (ioengine=libaio, queue_depth=4,
direct=1)

      IOPS: 6542
      AVG LAT: 41ms

2- Creating 1 large image 4,2TB in the pool. Using rbd kernel
module, format and mount the image as ext4. Each fio job writing
in a separate file in the same directory. (ioengine=libaio,
queue_depth=4,direct=1)

     IOPS: 5899
     AVG LAT:  47ms

3- Creating 1 large image 4,2TB in the pool. Use ioengine rbd in
fio to access the image through librados. (ioengine=rbd,
queue_depth=4,direct=1)

     IOPS: 2638
     AVG LAT: 96ms

Do these results make sense? From Ceph perspective, It is better to
have many small images than a larger one? What is the best approach
to simulate the workload of 70 VMs?
I'm not sure the difference between the first two cases is enough to
say much yet.  You may need to repeat the test a couple of times to
ensure that the difference is more than noise.  having said that, if
we are seeing an effect, it would be interesting to know what the
latency distribution is like.  is it consistently worse in the 2nd
case or do we see higher spikes at specific times?

I've repeated the tests with similar results. Each test is done with
a clean new rbd image, first removing any existing images in the
pool and then creating the new image. Between tests I am running:

   echo 3 > /proc/sys/vm/drop_caches

- In the first test I've created 70 images (60G) and mounted them:

/dev/rbd1 on /mnt/fiotest/vtest0
/dev/rbd2 on /mnt/fiotest/vtest1
..
/dev/rbd70 on /mnt/fiotest/vtest69

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=21852: Tue Jul  8
14:52:56 2014
    write: io=2559.5MB, bw=26179KB/s, iops=6542, runt=100116msec
      slat (usec): min=18, max=512646, avg=4002.62, stdev=13754.33
      clat (usec): min=867, max=579715, avg=37581.64, stdev=55954.19
       lat (usec): min=903, max=586022, avg=41957.74, stdev=59276.40
      clat percentiles (msec):
       |  1.00th=[    5],  5.00th=[   10], 10.00th=[   13],
20.00th=[   18], | 30.00th=[   21], 40.00th=[   26], 50.00th=[   31],
60.00th=[   34], | 70.00th=[   37], 80.00th=[   41], 90.00th=[   48],
95.00th=[   61], | 99.00th=[  404], 99.50th=[  445], 99.90th=[  494],
99.95th=[  515], | 99.99th=[  553]
      bw (KB  /s): min=    0, max=  694, per=1.46%, avg=383.29,
stdev=148.01 lat (usec) : 1000=0.01%
      lat (msec) : 2=0.12%, 4=0.63%, 10=4.82%, 20=22.33%, 50=63.97%
      lat (msec) : 100=5.61%, 250=0.47%, 500=2.01%, 750=0.08%
    cpu          : usr=0.69%, sys=2.57%, ctx=1525021, majf=0,
minf=2405 IO depths    : 1=1.1%, 2=0.6%, 4=335.8%, 8=0.0%, 16=0.0%,
32=0.0%,
=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,
=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,
=64=0.0%
       issued    : total=r=0/w=655015/d=0, short=r=0/w=0/d=0
       latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
    WRITE: io=2559.5MB, aggrb=26178KB/s, minb=26178KB/s,
maxb=26178KB/s, mint=100116msec, maxt=100116msec

Disk stats (read/write):
    rbd1: ios=0/2408612, merge=0/979004, ticks=0/39436432,
in_queue=39459720, util=99.68%

- In the second test I only created one large image (4,2T)

/dev/rbd1 on /mnt/fiotest/vtest0 type ext4
(rw,noatime,nodiratime,data=ordered)

fio output:

rand-write-4k: (groupid=0, jobs=70): err= 0: pid=8907: Wed Jul  9
13:38:14 2014
    write: io=2264.6MB, bw=23143KB/s, iops=5783, runt=100198msec
      slat (usec): min=0, max=3099.8K, avg=4131.91, stdev=21388.98
      clat (usec): min=850, max=3133.1K, avg=43337.56, stdev=93830.42
       lat (usec): min=930, max=3147.5K, avg=48253.22, stdev=100642.53
      clat percentiles (msec):
       |  1.00th=[    5],  5.00th=[   11], 10.00th=[   14],
20.00th=[   19], | 30.00th=[   24], 40.00th=[   29], 50.00th=[   33],
60.00th=[   36], | 70.00th=[   39], 80.00th=[   43], 90.00th=[   51],
95.00th=[   68], | 99.00th=[  506], 99.50th=[  553], 99.90th=[  717],
99.95th=[  783], | 99.99th=[ 3130]
      bw (KB  /s): min=    0, max=  680, per=1.54%, avg=355.39,
stdev=156.10 lat (usec) : 1000=0.01%
      lat (msec) : 2=0.12%, 4=0.66%, 10=4.21%, 20=17.82%, 50=66.95%
      lat (msec) : 100=7.34%, 250=0.78%, 500=1.10%, 750=0.99%,
1000=0.02% lat (msec) : >=2000=0.04%
    cpu          : usr=0.65%, sys=2.45%, ctx=1434322, majf=0,
minf=2399 IO depths    : 1=0.2%, 2=0.1%, 4=365.4%, 8=0.0%, 16=0.0%,
32=0.0%,
=64=0.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,
=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
64=0.0%,
=64=0.0%
       issued    : total=r=0/w=579510/d=0, short=r=0/w=0/d=0
       latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
    WRITE: io=2264.6MB, aggrb=23142KB/s, minb=23142KB/s,
maxb=23142KB/s, mint=100198msec, maxt=100198msec

Disk stats (read/write):
    rbd1: ios=0/2295106, merge=0/926648, ticks=0/39660664,
in_queue=39706288, util=99.80%



It seems that latency is more stable in the first case.
So I guess what comes to mind is when you have all of the fio
processes writing to files on a single file system there's now
another whole layer of locks and contention.  Not sure how likely
this is though.  Josh might be able to chime in if there's something
on the RBD side that could slow this kind of use case down.

In case 3, do you have multiple fio jobs going or just 1?
In all three cases, I am using one fio process with NUMJOBS=70
Is RBD cache enabled?  It's interesting that librbd is so much slower
in this case than kernel RBD for you.  If anything I would have
expected the opposite.

Come again?
User space RBD with the default values will have little to no impact in
this scenario.

Whereas kernel space RBD will be able to use every last byte of memory
for page cache, totally ousting users pace RBD.

Regards,

Christian
Hi Cristian!

I am using "direct=1" with fio in all tests, this should not bypass the
page cache?

It should and will do that inside the VM, but the RBD cache is outside of
that.
In the case of kernel space RBD and writeback caching enabled on the VM
(KVM/qemu) the page cache of the HOST is being used for RBD caching,
something you should be able to see easily when looking at your memory
usage (buffers) when testing with large datasets.

Christian
I am using kernel rbd module in a KVM VM, but the rbd device is mounted
inside the VM, so the HOST is not aware of those IOPS generated by the
VM, because the VM is talking directly with the OSDs and the only
possible used page cache is the one from the VM, but it should be bypassed.

I think you have thought that I was running the test in a VM disk backed
by a rbd device in the HOST, isn't it?, but it is not the case. And this
is why I am not understanding these differences between rbd kernel and
librados with fio.

The write path and code involved is different, especially if you are using RBD cache. You might try disabling it just to see what happens.

I wonder also if you might want to test the same filesystem mounted on a volume with qemu/kvm with librbd. Perhaps there is something else going on that we don't understand yet.


BR,
Xabier


Best Regards,
Xabier

thanks in advance or any help,
Xabier
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to