Re: [ceph-users] Ceph all NVME Cluster sequential read speed

Nick Fisk Thu, 18 Aug 2016 02:24:07 -0700

> -----Original Message-----
> From: ceph-users [mailto:[email protected]] On Behalf Of 
> [email protected]
> Sent: 18 August 2016 09:35
> To: nick <[email protected]>
> Cc: ceph-users <[email protected]>
> Subject: Re: [ceph-users] Ceph all NVME Cluster sequential read speed
> 
> 
> 
> > Op 18 aug. 2016 om 10:15 heeft nick <[email protected]> het volgende geschreven:
> >
> > Hi,
> > we are currently building a new ceph cluster with only NVME devices.
> > One Node consists of 4x Intel P3600 2TB devices. Journal and filestore
> > are on the same device. Each server has a 10 core CPU and uses 10 GBit
> > ethernet NICs for public and ceph storage traffic. We are currently testing 
> > with 4 nodes overall.
> >
> > The cluster will be used only for virtual machine images via RBD. The
> > pools are replicated (no EC).
> >
> > Altough we are pretty happy with the single threaded write
> > performance, the single threaded (iodepth=1) sequential read
> > performance is a bit disappointing.
> >
> > We are testing with fio and the rbd engine. After creating a 10GB RBD
> > image, we use the following fio params to test:
> > """
> > [global]
> > invalidate=1
> > ioengine=rbd
> > iodepth=1
> > ramp_time=2
> > size=2G
> > bs=4k
> > direct=1
> > buffered=0
> > """
> >
> > For a 4k workload we are reaching 1382 IOPS. Testing one NVME device
> > directly (with psync engine and iodepth of 1) we can reach up to 84176
> > IOPS. This is a big difference.
> >
> 
> Network is a big difference as well. Keep in mind the Ceph OSDs have to 
> process the I/O as well.
> 
> For example, if you have a network latency of 0.200ms, in 1.000ms (1 sec) you 
> will be able to potentially do 5.000 IOps, but that
is
> without the OSD or any other layers doing any work.
> 
> 
> > I already read that the read_ahead setting might improve the
> > situation, although this would only be true when using buffered reads, 
> > right?
> >
> > Does anyone have other suggestions to get better serial read performance?
> >
> 
> You might want to disable all logging and look at AsyncMessenger. Disabling 
> cephx might help, but that is not very safe to do.


Just to add what Wido has mentioned. The problem is latency serialisation, the 
effect of the network, ceph code means that each IO
request has to travel further than if you are comparing to a local SATA cable.

The trick is to try and remove as much of this as possible where you can. Wido 
has mentioned 1 good option of turning off logging.
One thing I have found which helps massively is to force the CPU c-state to 1 
and pin the CPU's at their max frequency. Otherwise
the CPU's can spend up to 200us waking up from deep sleep several times every 
IO. Doing this I managed to get my 4kb write latency
for a 3x replica pool down to 600us!!

So stick this on your kernel boot line 

intel_idle.max_cstate=1

and stick this somewhere like your rc.local

echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct

Although there maybe some gains to setting it to 90-95%, so that when only 1 
core is active it can turbo slightly higher.

Also since you are using the RBD engine in fio you should be able to use 
readahead caching with directio. You just need to enable it
in your ceph.conf on the client machine where you are running fio.

Nick

> 
> Wido
> 
> > Cheers
> > Nick
> >
> > --
> > Sebastian Nickel
> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel
> > +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph all NVME Cluster sequential read speed

Reply via email to