Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

Igor Fedotov Wed, 02 May 2018 08:12:12 -0700

Hi Nick,

On 5/1/2018 11:50 PM, Nick Fisk wrote:

Hi all,
Slowly getting round to migrating clusters to Bluestore but I aminterested in how people are handling the potential change in writelatency coming from Filestore? Or maybe nobody is really seeing muchdifference?
As we all know, in Bluestore, writes are not double written and inmost cases go straight to disk. Whilst this is awesome for people withpure SSD or pure HDD clusters as the amount of overhead is drasticallyreduced, for people with HDD+SSD journals in Filestore land, thedouble write had the side effect of acting like a battery backedcache, accelerating writes when not under saturation.
In some brief testing I am seeing Filestore OSD’s with NVME journalshow an average apply latency of around 1-2ms whereas some newBluestore OSD’s in the same cluster are showing 20-40ms. I am fairlycertain this is due to writes exhibiting the latency of the underlying7.2k disk. Note, cluster is very lightly loaded, this is not anythingbeing driven into saturation.
I know there is a deferred write tuning knob which adjusts the cutoverfor when an object is double written, but at the default of 32kb, Isuspect a lot of IO’s even in the 1MB area are still drasticallyslower going straight to disk than if double written to NVME 1^st .Has anybody else done any investigation in this area? Is there anylong turn harm at running a cluster deferring writes up to 1MB+ insize to mimic the Filestore double write approach?

This should work fine with low load but be careful when load is raising.RocksDB and corresponding stuff around it might become a bottleneck inthis scenario.

I also suspect after looking through github that deferred writes onlyhappen when overwriting an existing object or blob (not sure whichcase applies), so new allocations are still written straight to disk.Can anyone confirm?

"small" writes (length < min_alloc_size) are direct if they go to unusedchunk (4K or more depending on checksum settings) of an existing mutableblock and write length > bluestore_prefer_deferred_size only.E.g. appending with 4K data blocks to an object at HDD will triggerdeferred mode for the first of every 16 writes (given that defaultmin_alloc_size for HDD is 64K). Rest 15 go direct.

"big" writes are unconditionally deferred if length <=bluestore_prefer_deferred_size.

PS. If your spinning disks are connected via a RAID controller withBBWC then you are not affected by this.
Thanks,

Nick



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences

Reply via email to