Re: [ceph-users] Persistent Write Back Cache
Hi Nick, Christian, This is something we've discussed a bit but hasn't made it to the top of the list. I think having a single persistent copy on the client has *some* value, although it's limited because its a single point of failure. The simplest scenario would be to use it as a write-through cache that accellerates reads only. Another option would be to have a shared but local device (like an SSD that is connected to a pair of client hosts, or has fast access within a rack--a scenario that I've heard a few vendors talk about). It still leaves a host pair or rack as a failure zone, but there are times where that's appropriate. In either case, though, I think the key RBD feature that would make it much more valuable would be if RBD (librbd presumably) could maintain the writeback cache with some sort of checkpoints or journal internally such that writes that get flushed back to the cluster are always *crash consistent*. So even if you lose the client cache entirely, your disk image is still holding a valid file system that looks like it is just a little bit stale. If the client-side writeback cache were structured as a data journal this would be pretty staightforward... it might even mesh well with the RBD mirroring? sage On Wed, 4 Mar 2015, Nick Fisk wrote: Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine
Re: [ceph-users] Persistent Write Back Cache
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John Spray Sent: 04 March 2015 11:34 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Persistent Write Back Cache On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Cheers, John Hi John, I guess it's to make things easier rather than having to run a huge stack of different technologies to achieve the same goal, especially when half of the caching logic is already in Ceph. It would be really nice and drive adoption if you could could add a SSD, set a config option and suddenly you have a storage platform that performs 10x faster. Another way of handling it might be for librbd to be pointed at a uuid instead of a /dev/sd* device. That way librbd knows what cache device to look for and will error out if the cache device is missing. These cache devices could then be presented to all necessary servers via iSCSI or something similar if the RBD will need to move around. Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
On 03/04/2015 05:34 AM, John Spray wrote: On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO’s are getting coalesced into nice large 512kb IO’s at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Agreed regarding flashcache/bcache/dm-cache. I suspect improving an existing project rather than reinventing it ourselves would be the way to go. It may also be worth looking at Luis's work, though I note that he specifically says write-through: http://vault2015.sched.org/event/6cc56a5b8a95ead46961697028b59c39#.VPc0uX-etWQ https://github.com/pblcache/pblcache Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
Hello Nick, On Wed, 4 Mar 2015 08:49:22 - Nick Fisk wrote: Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. Which is pretty much any and all use cases I can think about. Because it's not only concurrent (active/active) accesses, but you really need to have things consistent across all possible client hosts in case of a node failure. I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it into Debian Jessie, queue massive laughter and ridicule), btw. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly more suited for some use cases) is that it allows me n+1 instead of n+n redundancy when it comes to consumers (compute nodes in my case). Now for your iSCSI head (looking forward to your results and any config recipes) that limitation to a pair may be just as well, but as others wrote it might be best to go forward with this outside of Ceph. Especially since you're already dealing with a HA cluster/pacemaker in that scenario. Christian Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO’s are getting coalesced into nice large 512kb IO’s at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com