Re: [ceph-users] any recommendation of using EnhanceIO?

Nick Fisk Tue, 18 Aug 2015 10:07:29 -0700

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 17:13
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> 
> 
> > On 18 Aug 2015, at 16:44, Nick Fisk <n...@fisk.me.uk> wrote:
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 18 August 2015 14:51
> >> To: Nick Fisk <n...@fisk.me.uk>; 'Jan Schermer' <j...@schermer.cz>
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>
> >>
> >>
> >> On 08/18/2015 06:47 AM, Nick Fisk wrote:
> >>> Just to chime in, I gave dmcache a limited test but its lack of
> >>> proper
> >> writeback cache ruled it out for me. It only performs write back
> >> caching on blocks already on the SSD, whereas I need something that
> >> works like a Battery backed raid controller caching all writes.
> >>>
> >>> It's amazing the 100x performance increase you get with RBD's when
> >>> doing
> >> sync writes and give it something like just 1GB write back cache with
> >> flashcache.
> >>
> >> For your use case, is it ok that data may live on the flashcache for
> >> some amount of time before making to ceph to be replicated?  We've
> >> wondered internally if this kind of trade-off is acceptable to
> >> customers or not should the flashcache SSD fail.
> >
> > Yes, I agree, it's not ideal. But I believe it’s the only way to get the
> performance required for some workloads that need write latency's <1ms.
> >
> > I'm still in testing at the moment with the testing kernel that includes 
> > blk-
> mq fixes for large queue depths and max io sizes. But if we decide to put into
> production, it would be using 2x SAS dual port SSD's in RAID1 across two
> servers for HA. As we are currently using iSCSI from these two servers, there
> is no real loss of availability by doing this. Generally I think as long as 
> you build
> this around the fault domains of the application you are caching, it shouldn't
> impact too much.
> >
> > I guess for people using openstack and other direct RBD interfaces it may
> not be such an attractive option. I've been thinking that maybe Ceph needs
> to have an additional daemon with very low overheads, which is run on SSD's
> to provide shared persistent cache devices for librbd. There's still a trade 
> off,
> maybe not as much as using Flashcache, but for some workloads like
> database's, many people may decide that it's worth it. Of course I realise 
> this
> would be a lot of work and everyone is really busy, but in terms of
> performance gained it would most likely have a dramatic effect in making
> Ceph look comparable to other solutions like VSAN or ScaleIO when it comes
> to high iops/low latency stuff.
> >
> 
> Additional daemon that is persistent how? Isn't that what journal does
> already, just too slowly?


The journal is part of an OSD, as is speed restricted by a lot of the 
functionality that Ceph has to provide. I was more thinking of a very light 
weight "service" that acts as an interface between a SSD and librbd and is 
focussed on speed. For something like a standalone SQL server it might run on 
the SQL server with a local SSD, but in other scenarios you might have this 
"service" remote where the SSD's are installed. HA for the SSD could be 
provided by RAID+Dual Port SAS, or maybe some sort of lightweight replication 
could be built into the service.


This was just a random though rather than something I have planned out though.

> 
> I think the best (and easiest!) approach is to mimic what a monilithic SAN
> does
> 
> Currently
> 1) client issues blocking/atomic/sync IO
> 2) rbd client sends this IO to all OSDs
> 3) after all OSDs "process the IO", the IO is finished and considered 
> persistent
> 
> That has serious implications
>       * every IO is processed separately, not much coalescing
>       * OSD processes add the latency when processing this IO
>       * one OSD can be slow momentarily, IO backs up and the cluster
> stalls
> 
> Let me just select what "processing the IO" means with respect to my
> architecture and I can likely get a 100x improvement
> 
> Let me choose:
> 
> 1) WHERE the IO is persisted
> Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2)
> sufficient?
> Not waiting for one slow OSD gives me at least some SLA for planned tasks
> like backfilling, scrubbing, deep-scrubbing Hands up who can afford to leav
> deep-scrub enabled in production...

In my testing the difference between 2 and 3 Replica's wasn't that much, as 
once the primary OSD sends out the replica's they happen more or less in 
parallel.

> 
> 2) WHEN the IO is persisted
> Do I really need all OSDs to flush the data to disk?
> If all the nodes are in the same cabinet and on the same UPS then this makes
> sense.
> But my nodes are actually in different buildings ~10km apart. The chances of
> power failing simultaneously, N+1 UPSes failing simultaneously, diesels 
> failing
> simultaneously... When nukes start falling and this happens then I'll start
> looking for backups.
> Even if your nodes are in one datacentre, there are likely redundant (2+)
> circuits.
> And even if you have just one cabinet, you can add 3x UPS in there and gain a
> nice speed boost.
> 
> So the IO could be actually pretty safe and happy when it gets to a remote
> buffers on enough (quorum) nodes  and waits for processing. It can be
> batched, it can be coalesced, it can be rewritten with subsequent updates...

Agreed, it would be nice if once the primary OSD has written its data, it 
returns an ACK to the client. It is then responsible for ensuring data is 
written to other OSD's later on. Is this what the Async Messenger does????

> 
> 3)  WHAT amount of IO is stored
> Do I need to have the last transaction or can I tolerate 1 minute of missing
> data?
> Checkpoints, checksums on last transaction, rollback (journal already does
> this AFAIK)...
> 
> 4) I DON'T CARE mode :-)
> qemu cache=unsafe equivalent but set on a RBD volume/pool Because
> sometimes you just need to crunch data without really storing them
> persistently - how are CERN/HADOOP/Big Data guys approcaching this?
> And you can't always disable flushing. Filesystems have "nobarriers" (usually)
> but if you need a block device for raw database tablespace, you're pretty
> much SOL without lots of trickery
> 
> 
> 1) is doable eventually.
> 
> 2) is doable almost immediately
>       a) just ACK the IO when you get it, let the client unblock on quorum
>       or
>       b) drop the journal, write all data asynchronously, let the filesystem
> handle consistency and let me tune dirty_writeback_centisecs to get the goal
> i want in respect to 3)
> 
> 4) simple to do, unusable for production (for most of us)
>       but flushing is expensive so why flush because a file metadata
> changed on a QA machine?
>       Dev&QA often create a higher load than production itself..
> 
> sorry, got carried away, again....
> 
> Jan
> 
> 
> >>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>>> Behalf Of Jan Schermer
> >>>> Sent: 18 August 2015 12:44
> >>>> To: Mark Nelson <mnel...@redhat.com>
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
> >>>>
> >>>> I did not. Not sure why now - probably for the same reason I didn't
> >>>> extensively test bcache.
> >>>> I'm not a real fan of device mapper though, so if I had to choose
> >>>> I'd still go for bcache :-)
> >>>>
> >>>> Jan
> >>>>
> >>>>> On 18 Aug 2015, at 13:33, Mark Nelson <mnel...@redhat.com>
> wrote:
> >>>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>> Out of curiosity did you ever try dm-cache?  I've been meaning to
> >>>>> give it a
> >>>> spin but haven't had the spare cycles.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> On 08/18/2015 04:00 AM, Jan Schermer wrote:
> >>>>>> I already evaluated EnhanceIO in combination with CentOS 6 (and
> >>>> backported 3.10 and 4.0 kernel-lt if I remember correctly).
> >>>>>> It worked fine during benchmarks and stress tests, but once we
> >>>>>> run
> >>>>>> DB2
> >>>> on it it panicked within minutes and took all the data with it
> >>>> (almost literally - files that werent touched, like OS binaries
> >>>> were b0rked and the filesystem was unsalvageable).
> >>>>>> If you disregard this warning - the performance gains weren't
> >>>>>> that great
> >>>> either, at least in a VM. It had problems when flushing to disk
> >>>> after reaching dirty watermark and the block size has some
> >>>> not-well-documented implications (not sure now, but I think it only
> >>>> cached IO _larger_than the block size, so if your database keeps
> >>>> incrementing an XX-byte counter it will go straight to disk).
> >>>>>>
> >>>>>> Flashcache doesn't respect barriers (or does it now?) - if that's
> >>>>>> ok for you
> >>>> than go for it, it should be stable and I used it in the past in
> >>>> production without problems.
> >>>>>>
> >>>>>> bcache seemed to work fine, but I needed to
> >>>>>> a) use it for root
> >>>>>> b) disable and enable it on the fly (doh)
> >>>>>> c) make it non-persisent (flush it) before reboot - not sure if
> >>>>>> that was
> >>>> possible either.
> >>>>>> d) all that in a customer's VM, and that customer didn't have a
> >>>>>> strong
> >>>> technical background to be able to fiddle with it...
> >>>>>> So I haven't tested it heavily.
> >>>>>>
> >>>>>> Bcache should be the obvious choice if you are in control of the
> >>>>>> environment. At least you can cry on LKML's shoulder when you
> >>>>>> lose data :-)
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>>
> >>>>>>> On 18 Aug 2015, at 01:49, Alex Gorbachev
> >>>>>>> <a...@iss-integration.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> What about https://github.com/Frontier314/EnhanceIO?  Last
> >>>>>>> commit
> >>>>>>> 2 months ago, but no external contributors :(
> >>>>>>>
> >>>>>>> The nice thing about EnhanceIO is there is no need to change
> >>>>>>> device name, unlike bcache, flashcache etc.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz
> >>>>>>> <d...@redhat.com>
> >>>> wrote:
> >>>>>>>> I did some (non-ceph) work on these, and concluded that bcache
> >>>>>>>> was the best supported, most stable, and fastest.  This was ~1
> >>>>>>>> year ago, to take it with a grain of salt, but that's what I
> >>>>>>>> would
> >> recommend.
> >>>>>>>>
> >>>>>>>> Daniel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: "Dominik Zalewski" <dzalew...@optlink.net>
> >>>>>>>> To: "German Anders" <gand...@despegar.com>
> >>>>>>>> Cc: "ceph-users" <ceph-users@lists.ceph.com>
> >>>>>>>> Sent: Wednesday, July 1, 2015 5:28:10 PM
> >>>>>>>> Subject: Re: [ceph-users] any recommendation of using
> EnhanceIO?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I’ve asked same question last weeks or so (just search the
> >>>>>>>> mailing list archives for EnhanceIO :) and got some interesting
> >> answers.
> >>>>>>>>
> >>>>>>>> Looks like the project is pretty much dead since it was bought
> >>>>>>>> out by
> >>>> HGST.
> >>>>>>>> Even their website has some broken links in regards to
> >>>>>>>> EnhanceIO
> >>>>>>>>
> >>>>>>>> I’m keen to try flashcache or bcache (its been in the mainline
> >>>>>>>> kernel for some time)
> >>>>>>>>
> >>>>>>>> Dominik
> >>>>>>>>
> >>>>>>>> On 1 Jul 2015, at 21:13, German Anders
> <gand...@despegar.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi cephers,
> >>>>>>>>
> >>>>>>>>   Is anyone out there that implement enhanceIO in a production
> >>>> environment?
> >>>>>>>> any recommendation? any perf output to share with the diff
> >>>>>>>> between using it and not?
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>>
> >>>>>>>> German
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-users@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-users@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list
> >>>>>>>> ceph-users@lists.ceph.com
> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-users@lists.ceph.com
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

Reply via email to