Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Sage Weil
Hi Nick, Christian,

This is something we've discussed a bit but hasn't made it to the top of 
the list.

I think having a single persistent copy on the client has *some* value, 
although it's limited because its a single point of failure.  The simplest 
scenario would be to use it as a write-through cache that accellerates 
reads only.

Another option would be to have a shared but local device (like an SSD 
that is connected to a pair of client hosts, or has fast access within a 
rack--a scenario that I've heard a few vendors talk about).  It 
still leaves a host pair or rack as a failure zone, but there are 
times where that's appropriate.

In either case, though, I think the key RBD feature that would make it 
much more valuable would be if RBD (librbd presumably) could maintain the 
writeback cache with some sort of checkpoints or journal internally such 
that writes that get flushed back to the cluster are always *crash 
consistent*.  So even if you lose the client cache entirely, your disk 
image is still holding a valid file system that looks like it is just a 
little bit stale.

If the client-side writeback cache were structured as a data journal this 
would be pretty staightforward...  it might even mesh well with the RBD 
mirroring?

sage



On Wed, 4 Mar 2015, Nick Fisk wrote:

 Hi Christian,
 
 Yes that's correct, it's on the client side. I don't see this much different
 to a battery backed Raid controller, if you lose power, the data is in the
 cache until power resumes when it is flushed.
 
 If you are going to have the same RBD accessed by multiple servers/clients
 then you need to make sure the SSD is accessible to both (eg DRBD / Dual
 Port SAS). But then something like pacemaker would be responsible for
 ensuring the RBD and cache device are both present before allowing client
 access.
 
 When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's,
 however I can understand that this feature would prove more of a challenge
 if you are using Qemu and RBD.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Christian Balzer
 Sent: 04 March 2015 08:40
 To: ceph-users@lists.ceph.com
 Cc: Nick Fisk
 Subject: Re: [ceph-users] Persistent Write Back Cache
 
 
 Hello,
 
 If I understand you correctly, you're talking about the rbd cache on the
 client side.
 
 So assume that host or the cache SSD on if fail terminally.
 The client thinks its sync'ed are on the permanent storage (the actual ceph
 storage cluster), while they are only present locally. 
 
 So restarting that service or VM on a different host now has to deal with
 likely crippling data corruption.
 
 Regards,
 
 Christian
 
 On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote:
 
  Hi All,
  
   
  
  Is there anything in the pipeline to add the ability to write the 
  librbd cache to ssd so that it can safely ignore sync requests? I have 
  seen a thread a few years back where Sage was discussing something 
  similar, but I can't find anything more recent discussing it.
  
   
  
  I've been running lots of tests on our new cluster, buffered/parallel 
  performance is amazing (40K Read 10K write iops), very impressed. 
  However sync writes are actually quite disappointing.
  
   
  
  Running fio with 128k block size and depth=1, normally only gives me 
  about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
  and from what I hear that's about normal, so I don't think I have a 
  ceph config problem. For applications which do a lot of sync's, like 
  ESXi over iSCSI or SQL databases, this has a major performance impact.
  
   
  
  Traditional storage arrays work around this problem by having a 
  battery backed cache which has latency 10-100 times less than what you 
  can currently achieve with Ceph and an SSD . Whilst librbd does have a 
  writeback cache, from what I understand it will not cache syncs and so 
  in my usage case, it effectively acts like a write through cache.
  
   
  
  To illustrate the difference a proper write back cache can make, I put 
  a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
  tweaked the flush parameters to flush dirty blocks at a large queue 
  depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
  limited by the performance of SSD used by flashcache, as everything is 
  stored as 4k blocks on the ssd. In fact since everything is stored as 
  4k blocks, pretty much all IO sizes are accelerated to max speed of the
 SSD.
  Looking at iostat I can see all the IO's are getting coalesced into 
  nice large 512kb IO's at a high queue depth, which Ceph easily swallows.
  
   
  
  If librbd could support writing its cache out to SSD it would 
  hopefully achieve the same level of performance and having it 
  integrated would be really neat.
  
   
  
  Nick
  
  
  
  
 
 
 -- 
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com Global OnLine

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Nick Fisk
 

 

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
John Spray
Sent: 04 March 2015 11:34
To: Nick Fisk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Persistent Write Back Cache

 

 

On 04/03/2015 08:26, Nick Fisk wrote:

To illustrate the difference a proper write back cache can make, I put a 1GB
(512mb dirty threshold) flashcache in front of my RBD and tweaked the flush
parameters to flush dirty blocks at a large queue depth. The same fio test
(128k iodepth=1) now runs at 120MB/s and is limited by the performance of
SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In
fact since everything is stored as 4k blocks, pretty much all IO sizes are
accelerated to max speed of the SSD. Looking at iostat I can see all the
IO's are getting coalesced into nice large 512kb IO's at a high queue depth,
which Ceph easily swallows. 

 

If librbd could support writing its cache out to SSD it would hopefully
achieve the same level of performance and having it integrated would be
really neat. 

What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it?  It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.

Cheers,
John

 

Hi John,

 

I guess it's to make things easier rather than having to run a huge stack of
different technologies to achieve the same goal, especially when half of the
caching logic is already in Ceph. It would be really nice and drive adoption
if you could could add a SSD, set a config option and suddenly you have a
storage platform that performs 10x faster.

 

Another way of handling it might be for librbd to be pointed at a uuid
instead of a /dev/sd* device. That way librbd knows what cache device to
look for and will error out if the cache device is missing. These cache
devices could then be presented to all necessary servers via iSCSI or
something similar if the RBD will need to move around.

 

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Mark Nelson

On 03/04/2015 05:34 AM, John Spray wrote:



On 04/03/2015 08:26, Nick Fisk wrote:

To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of
the SSD. Looking at iostat I can see all the IO’s are getting
coalesced into nice large 512kb IO’s at a high queue depth, which Ceph
easily swallows.

If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.


What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it?  It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.


Agreed regarding flashcache/bcache/dm-cache.  I suspect improving an 
existing project rather than reinventing it ourselves would be the way 
to go.  It may also be worth looking at Luis's work, though I note that 
he specifically says write-through:


http://vault2015.sched.org/event/6cc56a5b8a95ead46961697028b59c39#.VPc0uX-etWQ

https://github.com/pblcache/pblcache



Cheers,
John


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Christian Balzer

Hello Nick,

On Wed, 4 Mar 2015 08:49:22 - Nick Fisk wrote:

 Hi Christian,
 
 Yes that's correct, it's on the client side. I don't see this much
 different to a battery backed Raid controller, if you lose power, the
 data is in the cache until power resumes when it is flushed.
 
 If you are going to have the same RBD accessed by multiple
 servers/clients then you need to make sure the SSD is accessible to both
 (eg DRBD / Dual Port SAS). But then something like pacemaker would be
 responsible for ensuring the RBD and cache device are both present
 before allowing client access.
 
Which is pretty much any and all use cases I can think about.
Because it's not only concurrent (active/active) accesses, but you
really need to have things consistent across all possible client hosts in
case of a node failure.

I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it
into Debian Jessie, queue massive laughter and ridicule), btw.

 When I wrote this I was thinking more about 2 HA iSCSI servers with
 RBD's, however I can understand that this feature would prove more of a
 challenge if you are using Qemu and RBD.
 
One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly
more suited for some use cases) is that it allows me n+1 instead of n+n
redundancy when it comes to consumers (compute nodes in my case). 

Now for your iSCSI head (looking forward to your results and any config
recipes) that limitation to a pair may be just as well, but as others
wrote it might be best to go forward with this outside of Ceph.
Especially since you're already dealing with a HA cluster/pacemaker in
that scenario.


Christian

 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Christian Balzer
 Sent: 04 March 2015 08:40
 To: ceph-users@lists.ceph.com
 Cc: Nick Fisk
 Subject: Re: [ceph-users] Persistent Write Back Cache
 
 
 Hello,
 
 If I understand you correctly, you're talking about the rbd cache on the
 client side.
 
 So assume that host or the cache SSD on if fail terminally.
 The client thinks its sync'ed are on the permanent storage (the actual
 ceph storage cluster), while they are only present locally. 
 
 So restarting that service or VM on a different host now has to deal with
 likely crippling data corruption.
 
 Regards,
 
 Christian
 
 On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote:
 
  Hi All,
  
   
  
  Is there anything in the pipeline to add the ability to write the 
  librbd cache to ssd so that it can safely ignore sync requests? I have 
  seen a thread a few years back where Sage was discussing something 
  similar, but I can't find anything more recent discussing it.
  
   
  
  I've been running lots of tests on our new cluster, buffered/parallel 
  performance is amazing (40K Read 10K write iops), very impressed. 
  However sync writes are actually quite disappointing.
  
   
  
  Running fio with 128k block size and depth=1, normally only gives me 
  about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
  and from what I hear that's about normal, so I don't think I have a 
  ceph config problem. For applications which do a lot of sync's, like 
  ESXi over iSCSI or SQL databases, this has a major performance impact.
  
   
  
  Traditional storage arrays work around this problem by having a 
  battery backed cache which has latency 10-100 times less than what you 
  can currently achieve with Ceph and an SSD . Whilst librbd does have a 
  writeback cache, from what I understand it will not cache syncs and so 
  in my usage case, it effectively acts like a write through cache.
  
   
  
  To illustrate the difference a proper write back cache can make, I put 
  a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
  tweaked the flush parameters to flush dirty blocks at a large queue 
  depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
  limited by the performance of SSD used by flashcache, as everything is 
  stored as 4k blocks on the ssd. In fact since everything is stored as 
  4k blocks, pretty much all IO sizes are accelerated to max speed of the
 SSD.
  Looking at iostat I can see all the IO's are getting coalesced into 
  nice large 512kb IO's at a high queue depth, which Ceph easily
  swallows.
  
   
  
  If librbd could support writing its cache out to SSD it would 
  hopefully achieve the same level of performance and having it 
  integrated would be really neat.
  
   
  
  Nick
  
  
  
  
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Nick Fisk
Hi Christian,

Yes that's correct, it's on the client side. I don't see this much different
to a battery backed Raid controller, if you lose power, the data is in the
cache until power resumes when it is flushed.

If you are going to have the same RBD accessed by multiple servers/clients
then you need to make sure the SSD is accessible to both (eg DRBD / Dual
Port SAS). But then something like pacemaker would be responsible for
ensuring the RBD and cache device are both present before allowing client
access.

When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's,
however I can understand that this feature would prove more of a challenge
if you are using Qemu and RBD.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Christian Balzer
Sent: 04 March 2015 08:40
To: ceph-users@lists.ceph.com
Cc: Nick Fisk
Subject: Re: [ceph-users] Persistent Write Back Cache


Hello,

If I understand you correctly, you're talking about the rbd cache on the
client side.

So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual ceph
storage cluster), while they are only present locally. 

So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.

Regards,

Christian

On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote:

 Hi All,
 
  
 
 Is there anything in the pipeline to add the ability to write the 
 librbd cache to ssd so that it can safely ignore sync requests? I have 
 seen a thread a few years back where Sage was discussing something 
 similar, but I can't find anything more recent discussing it.
 
  
 
 I've been running lots of tests on our new cluster, buffered/parallel 
 performance is amazing (40K Read 10K write iops), very impressed. 
 However sync writes are actually quite disappointing.
 
  
 
 Running fio with 128k block size and depth=1, normally only gives me 
 about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
 and from what I hear that's about normal, so I don't think I have a 
 ceph config problem. For applications which do a lot of sync's, like 
 ESXi over iSCSI or SQL databases, this has a major performance impact.
 
  
 
 Traditional storage arrays work around this problem by having a 
 battery backed cache which has latency 10-100 times less than what you 
 can currently achieve with Ceph and an SSD . Whilst librbd does have a 
 writeback cache, from what I understand it will not cache syncs and so 
 in my usage case, it effectively acts like a write through cache.
 
  
 
 To illustrate the difference a proper write back cache can make, I put 
 a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
 tweaked the flush parameters to flush dirty blocks at a large queue 
 depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
 limited by the performance of SSD used by flashcache, as everything is 
 stored as 4k blocks on the ssd. In fact since everything is stored as 
 4k blocks, pretty much all IO sizes are accelerated to max speed of the
SSD.
 Looking at iostat I can see all the IO's are getting coalesced into 
 nice large 512kb IO's at a high queue depth, which Ceph easily swallows.
 
  
 
 If librbd could support writing its cache out to SSD it would 
 hopefully achieve the same level of performance and having it 
 integrated would be really neat.
 
  
 
 Nick
 
 
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread John Spray



On 04/03/2015 08:26, Nick Fisk wrote:
To illustrate the difference a proper write back cache can make, I put 
a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
tweaked the flush parameters to flush dirty blocks at a large queue 
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
limited by the performance of SSD used by flashcache, as everything is 
stored as 4k blocks on the ssd. In fact since everything is stored as 
4k blocks, pretty much all IO sizes are accelerated to max speed of 
the SSD. Looking at iostat I can see all the IO’s are getting 
coalesced into nice large 512kb IO’s at a high queue depth, which Ceph 
easily swallows.


If librbd could support writing its cache out to SSD it would 
hopefully achieve the same level of performance and having it 
integrated would be really neat.


What are you hoping to gain from building something into ceph instead of 
using flashcache/bcache/dm-cache on top of it?  It seems like since you 
would anyway need to handle your HA configuration, setting up the actual 
cache device would be the simple part.


Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com