On 23/06/14 18:51, Christian Balzer wrote:
On Sunday, June 22, 2014, Mark Kirkwood <[email protected]>
rbd cache max dirty = 1073741824
rbd cache max dirty age = 100


Mark, you're giving it a 2GB cache.
For a write test that's 1GB in size.
"Aggressively set" is a bit of an understatement here. ^o^
Most people will not want to spend this much memory on write-only caching.

Of course with these settings that test will yield impressive results.

However if you'd observe your storage nodes, OSDs, you will see that this
is still going to take the same time until it is actually, finally written
to disk. Same with using kernelspace RBD and caching enabled in the VM.
Doing similar tests with fio I managed to fill the cache and got fantastic
IOPS but then it took minutes to finally clean out.

Resulting in hung task warnings for the jbd process(es) like this:
---
May 28 16:58:56 tvm-03 kernel: [  960.320182] INFO: task jbd2/vda1-8:153 blocked
  for more than 120 seconds.
May 28 16:58:56 tvm-03 kernel: [  960.320866] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
---

Now this doesn't actively break things AFAICT, but it left me feeling
quite uncomfortable nevertheless.

Also what happens if something "bad" happens to the VM or it's host before
the cache is drained?

 From where I'm standing the RBD cache is fine for merging really small
writes and that's it.

Yes! And thank you Christian for writing (something very similar to) what I was about to write in response to Greg's question!

For database types (and yes I'm one of those)...you want to know that your writes (particularly your commit writes) are actually making it to persistent storage (that ACID thing you know). Now I see RBD cache very like battery backed RAID cards - your commits (i.e fsync or O_DIRECT writes) are not actually written, but are cached - so you are depending on the reliability of a) your RAID controller battery etc in that case or more interestingly b) your Ceph topology - to withstand node failures. Given we usually design a Ceph cluster with these things in mind it is probably ok [1]!

Regards

Mark

[1] Obviously my setup in use here - 2 ods, 2 SATA and 2 SSD all on the same host is merely a play/benchmark config and it *not* a topology designed with reliability in mind!
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to