Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

Nick Fisk Wed, 27 May 2015 03:30:16 -0700

Hi Jan,

Responses inline below


> -----Original Message-----
> From: ceph-users [mailto:[email protected]] On Behalf Of
> Jan Schermer
> Sent: 25 May 2015 21:14
> To: Nick Fisk
> Cc: [email protected]
> Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
> about them?
> 
> Hi Nick,
> 
> flashcache doesn’t support barriers, so I haven’t even considered it. I used a
> few years ago to speed up some workloads out of curiosity and it worked
> well, but I can’t use it to cache this kind of workload.
> 
> EnhanceIO passed my initial testing (although the documentation is very
> sketchy and the project abandoned AFAIK), and is supposed to respect
> barriers/flushes. I was only interested in a “volatile cache” scenario - 
> create a
> ramdisk in the guest (for example 1GB) and use it to cache the virtual block
> device (and of course flush and remove it before rebooting). All worked
> pretty well during my testing with fio & stuff until I ran the actual 
> workload -
> in my case a DB2 9.7 database. It took just minutes for the kernel to panic (I
> can share a screenshot if you’d like). So it was not a host failure but a 
> guest
> failure and it managed to fail on two fronts - stability and crash 
> consistency -
> at the same time. The filesystem was completely broken afterwards - while it
> could be mounted “cleanly” (journal appeared consistent), there was
> massive damage to the files. I expected the open files to be zeroed or
> missing or damaged, but it did veryrandom damage all over the place
> including binaries in /bin, manpage files and so on - things that nobody was
> even touching. Scary.

I see, so just to confirm you don't want to use a caching solution with an SSD, 
just a ram disk? I think that’s where are approaches differed and can 
understand why you are probably having problems when the OS crashes or suffers 
powerloss. I was going the SSD route, with something like:-

http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_review

On my iSCSI head nodes, but if you are exporting RBD's to lots of different 
servers I guess this wouldn't work quite the same.

I don't really see a solution that could work for you without using SSD's for 
the cache. You seem to be suffering from slow sync writes and want to cache 
them in a volatile medium, but by their very nature sync writes are meant to be 
safe once the write confirmation arrives. I guess in any caching solution 
barriers go some length to help guard against  data corruption but if properly 
implemented they will probably also slow the speed down to what you can achieve 
with just RBD caching. Much like Hardware Raid Controllers, they only enable 
writeback cache if they can guarantee data security, either by a functioning 
battery backup or flash device.


> 
> I don’t really understand your question about flashcache - do you run it in
> writeback mode? It’s been years since I used it so I won’t be much help here
> - I disregarded it as unsafe right away because of barriers and wouldn’t use 
> it
> in production.
> 

What I mean is that for every IO that passed through flashcache, I see it write 
to the SSD with no delay/buffering. So from a Kernel Panic/Powerloss situation, 
as long as the SSD has powerloss caps and the flashcache device is assembled 
correctly before mouting, I don't see a way for data to be lost. Although I 
haven't done a lot of testing around this yet, so I could be missing something.

> I don’t think a persistent cache is something to do right now, it would be
> overly complex to implement, it would limit migration, and it can be done on
> the guest side with (for example) bcache if really needed - you can always
> expose a local LVM volume to the guest and use it for caching (and that’s
> something I might end up doing) with mostly the same effect.
> For most people (and that’s my educated guess) the only needed features
> are that it needs to be fast(-er) and it needs to come up again after a crash
> without recovering for backup - that’s something that could be just a slight
> modification to the existing RBD cache - just don’t flush it on every fsync()
> but maintain ordering - and it’s done? I imagine some ordering is there
> already, it must be flushed when the guest is migrated, and it’s production-
> grade and not just some hackish attempt. It just doesn’t really cache the
> stuff that matters most in my scenario…

My initial idea was just to be able to specify a block device to use for 
writeback caching in librbd. This could either be a local block device (dual 
port sas for failover/cluster) or an iSCSI device if it needs to be shared 
around a larger cluster of hypervisors...etc

Ideally though this would all be managed through Ceph with some sort of 
"OSD-lite" device which is optimized for sync writes but misses out on a lot of 
the distributed functionality of a full fat OSD. This way you could create a 
writeback pool and then just specify it in the librbd config.

> 
> I wonder if cache=unsafe does what I want, but it’s hard to test that
> assumption unless something catastrophic happens like it did with EIO…

I think this option just doesn't act on flushes so will likely lose 
data....probably not what you want

> 
> Jan
> 
> > On 25 May 2015, at 19:58, Nick Fisk <[email protected]> wrote:
> >
> > Hi Jan,
> >
> > I share your frustrations with slow sync writes. I'm exporting RBD's via 
> > iSCSI
> to ESX, which seems to do most operations in 64k sync IO's. You can do a fio
> run and impress yourself with the numbers that you can get out of the
> cluster, but this doesn't translate into what you can achieve when using sync
> writes with a client.
> >
> > I have too been experimenting with flashcache/enhanceio with the goal to
> use Dual Port SAS SSD's to allow for HA iSCSI gateways. Currently I'm just
> testing with a single iSCSI server and see a massive improvement. I'm
> interested in the corruptions you have been experiencing on host crashes,
> are you implying that you think flashcache is buffering writes before
> submitting them to the SSD? When watching its behaviour using iostat it
> looks like it submits everything in 4k IO's to the SSD which to me looks like 
> it is
> not buffering.
> >
> > I did raise a topic a few months back asking about the possibility of librbd
> supporting persistent caching to SSD's, which would allow write back caching
> regardless if the client requests a flush. Although there was some interest in
> the idea, I didn't get the feeling it would be at the top of anyone's 
> priority's.
> >
> > Nick
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:[email protected]] On Behalf
> >> Of Jan Schermer
> >> Sent: 25 May 2015 09:59
> >> To: [email protected]
> >> Subject: [ceph-users] Synchronous writes - tuning and some thoughts
> >> about them?
> >>
> >> Hi,
> >> I have a full-ssd cluster on my hands, currently running Dumpling,
> >> with plans to upgrade soon, and Openstack with RBD on top of that.
> >> While I am overall quite happy with the performance (scales well
> >> accross clients), there is one area where it really fails bad - big 
> >> database
> workloads.
> >>
> >> Typically, what a well-behaved database does is commit to disk every
> >> transaction before confirming it, so on a “typical” cluster with a
> >> write latency of 5ms (with SSD journal) the maximum number of
> >> transactions per second for a single client is 200 (likely more like 100
> depending on the filesystem).
> >> Now, that’s not _too_ bad when running hundreds of small databases,
> >> but it’s nowhere near the required performance to subsitute an
> >> existing SAN or even just a simple RAID array with writeback cache.
> >>
> >> First hope was that enabling RBD cache will help - but it really
> >> doesn’t because all the flushes (O_DIRECT writes) end on the drives
> >> and not in the cache. Disabling barriers in the client helps, but
> >> that makes it not crash consistent (unless one uses ext4 with
> >> journal_checksum etc., I am going to test that soon).
> >>
> >> Are there any plans to change this behaviour - i.e. make the cache a
> >> real writeback cache?
> >>
> >> I know there are good reasons not to do this, and I commend the
> >> developers for designing the cache this way, but real world workloads
> >> demand shortcuts from time to time - for example MySQL with its
> >> InnoDB engine has an option to only commit to disk every Nth
> >> transaction - and this is exactly the kind of thing I’m looking for.
> >> Not having every confirmed transaction/write on the disk is not a
> >> huge problem, having a b0rked filesystem is, so this should be safe
> >> as long as I/O order is preserved. Sadly, my database is not an
> >> InnoDB where I can tune something, but an enterprise behemoth that
> >> traditionally runs on FC arrays, it has no parallelism (that I could 
> >> find), and
> always uses O_DIRECT for txlog.
> >>
> >> (For the record - while the array is able to swallow 30K IOps for a
> >> minute, once the cache is full it slows to ~3 IOps, while CEPH
> >> happily gives the same
> >> 200 IOps forever, bottom line is you always need more disks or more
> >> cache, and your workload should always be able to run without the
> >> cache anyway  - even enterprise arrays fail, and write cache is not
> >> always available, contrary to popular belief).
> >>
> >> Is there some option that we could use right now to turn on a true
> >> writeback caching? Losing a few transactions is fine as long as ordering is
> preserved.
> >> I was thinking “cache=unsafe” but I have no idea whether I/O order is
> >> preserved with that.
> >> I already mentioned turning off barriers, which could be safe in some
> >> setups but needs testing.
> >> Upgrading from Dumpling will probably help with scaling, but will it
> >> help write latency? I would need to get from 5ms/write to <1ms/write.
> >> I investigated guest-side caching (enhanceio/flashcache) but that
> >> fails really bad when the guest or host crashes - lots of corruption.
> >> EnhanceIO in particular looked very nice and claims to respect
> >> barriers… not in my experience, though.
> >>
> >> It might seem that what I want is evil, and it really is if you’re
> >> running a banking database, but for most people this is exactly what
> >> is missing to make their workloads run without having some sort of
> >> 80s SAN system in their datacentre, I think everyone here would
> >> appreciate that :-)
> >>
> >> Thanks
> >>
> >> Jan
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

Reply via email to