> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 27 May 2015 16:00
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Synchronous writes - tuning and some thoughts
> about them?
> 
> On 05/27/2015 09:33 AM, Jan Schermer wrote:
> > Hi Nick,
> > responses inline, again ;-)
> >
> > Thanks
> >
> > Jan
> >
> >> On 27 May 2015, at 12:29, Nick Fisk <n...@fisk.me.uk> wrote:
> >>
> >> Hi Jan,
> >>
> >> Responses inline below
> >>
> >>> -----Original Message-----
> >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>> Behalf Of Jan Schermer
> >>> Sent: 25 May 2015 21:14
> >>> To: Nick Fisk
> >>> Cc: ceph-users@lists.ceph.com
> >>> Subject: Re: [ceph-users] Synchronous writes - tuning and some
> >>> thoughts about them?
> >>>
> >>> Hi Nick,
> >>>
> >>> flashcache doesn’t support barriers, so I haven’t even considered
> >>> it. I used a few years ago to speed up some workloads out of
> >>> curiosity and it worked well, but I can’t use it to cache this kind of
> workload.
> >>>
> >>> EnhanceIO passed my initial testing (although the documentation is
> >>> very sketchy and the project abandoned AFAIK), and is supposed to
> >>> respect barriers/flushes. I was only interested in a “volatile
> >>> cache” scenario - create a ramdisk in the guest (for example 1GB)
> >>> and use it to cache the virtual block device (and of course flush
> >>> and remove it before rebooting). All worked pretty well during my
> >>> testing with fio & stuff until I ran the actual workload - in my
> >>> case a DB2 9.7 database. It took just minutes for the kernel to
> >>> panic (I can share a screenshot if you’d like). So it was not a host
> >>> failure but a guest failure and it managed to fail on two fronts -
> >>> stability and crash consistency - at the same time. The filesystem
> >>> was completely broken afterwards - while it could be mounted
> >>> “cleanly” (journal appeared consistent), there was massive damage to
> >>> the files. I expected the open files to be zeroed or missing or
> >>> damaged, but it did veryrandom damage all over the place including
> binaries in /bin, manpage files and so on - things that nobody was even
> touching. Scary.
> >>
> >> I see, so just to confirm you don't want to use a caching solution
> >> with an SSD, just a ram disk? I think that’s where are approaches
> >> differed and can understand why you are probably having problems when
> >> the OS crashes or suffers powerloss. I was going the SSD route, with
> >> something like:-
> >>
> >
> > This actually proves that EnhanceIO doesn’t really respect barriers, at 
> > least
> not when flushing blocks to the underlying device.
> > To be fair, maybe using a (mirrored!) SSD makes it crash-consistent, maybe
> it has an internal journal and just replays whatever is in cache - I will not 
> read
> the source code to confirm that because to me that’s clearly not what I need.
> 
> FWIW, I think both dm-cache and bcache properly respect barriers, though I
> haven't read through the source.
> 
> >
> >>
> http://www.storagereview.com/hgst_ultrastar_ssd800mm_enterprise_ssd_
> r
> >> eview
> >>
> >> On my iSCSI head nodes, but if you are exporting RBD's to lots of different
> servers I guess this wouldn't work quite the same.
> >>
> >
> > Exactly. If you want to maintain elasticity, want to be able to migrate
> instances freely, then using any local storage is a no-go.
> >
> >> I don't really see a solution that could work for you without using SSD's 
> >> for
> the cache. You seem to be suffering from slow sync writes and want to cache
> them in a volatile medium, but by their very nature sync writes are meant to
> be safe once the write confirmation arrives. I guess in any caching solution
> barriers go some length to help guard against  data corruption but if properly
> implemented they will probably also slow the speed down to what you can
> achieve with just RBD caching. Much like Hardware Raid Controllers, they
> only enable writeback cache if they can guarantee data security, either by a
> functioning battery backup or flash device.
> >>
> >
> > You are right. Sync writes and barriers are supposed to be flushed to
> physical medium when returning (though in practice lots of RAID controllers
> and _all_ arrays will lie about that, slightly breaking the spec but still 
> being
> safe if you don’t let the battery die).
> > I don’t want to lose crash consistency, but I don’t need to have the latest
> completed transaction flushed to the disk - I don’t care if power outage
> wipes the last 1 minute of records from the database even though they were
> “commited” by database and should thus be flushed to disks, and I don’t
> think too many people care either as long as it’s fast.
> > Of course, this won’t work for everyone and in that respect the current rbd
> cache behaviour is 100% correct.

Another potential option which honours barriers

http://www.lessfs.com/wordpress/?p=699

But I still don't see how you are going to differentiate between when you want 
to flush and when you don't. I might still be misunderstanding what you want to 
achieve. But it seems you want to honour barriers, but only once every minute, 
the rest of the time writes are just stored in Ram?


> > And of course it won’t solve all problems - if you have an underlying device
> that can do 200 IOPS but your workload needs 300 IOPS at all times, then
> caching the writes is a bit futile - it may help for a few seconds and then it
> gets back to 200 IOPS at best. It might, however help if you rewrite the same
> blocks again and again, incrementing a counter or updating one set of thata -
> there it will just update the dirty block in cache and flush it from time to 
> time.
> It can also turn some random-io into sequential-io, coalescing adjacent blocks
> into one re/write or journaling it in some way (CEPH journal does exactly this
> I think).

The coalescing also has a massive benefit to being able to turn a single 
threaded write into highly parallel writes. I've seen Flashcache generate queue 
depths in the hundreds from a client doing single threaded writes.

> >
> >
> >>
> >>>
> >>> I don’t really understand your question about flashcache - do you
> >>> run it in writeback mode? It’s been years since I used it so I won’t
> >>> be much help here
> >>> - I disregarded it as unsafe right away because of barriers and
> >>> wouldn’t use it in production.
> >>>
> >>
> >> What I mean is that for every IO that passed through flashcache, I see it
> write to the SSD with no delay/buffering. So from a Kernel Panic/Powerloss
> situation, as long as the SSD has powerloss caps and the flashcache device is
> assembled correctly before mouting, I don't see a way for data to be lost.
> Although I haven't done a lot of testing around this yet, so I could be 
> missing
> something.
> >>
> >
> > That would be flashcache in writethrough or writearound mode - in
> writeback mode it will always end up in cache and only after some time get
> flushed to disk (how fast it will get flushed can be tuned, but it will 
> always by
> async to the originating IO by nature).
> >
> >>> I don’t think a persistent cache is something to do right now, it
> >>> would be overly complex to implement, it would limit migration, and
> >>> it can be done on the guest side with (for example) bcache if really
> >>> needed - you can always expose a local LVM volume to the guest and
> >>> use it for caching (and that’s something I might end up doing) with
> mostly the same effect.
> >>> For most people (and that’s my educated guess) the only needed
> >>> features are that it needs to be fast(-er) and it needs to come up
> >>> again after a crash without recovering for backup - that’s something
> >>> that could be just a slight modification to the existing RBD cache -
> >>> just don’t flush it on every fsync() but maintain ordering - and
> >>> it’s done? I imagine some ordering is there already, it must be
> >>> flushed when the guest is migrated, and it’s production- grade and
> >>> not just some hackish attempt. It just doesn’t really cache the
> >>> stuff that matters most in my scenario…
> >>
> >> My initial idea was just to be able to specify a block device to use
> >> for writeback caching in librbd. This could either be a local block
> >> device (dual port sas for failover/cluster) or an iSCSI device if it
> >> needs to be shared around a larger cluster of hypervisors...etc
> >>
> >> Ideally though this would all be managed through Ceph with some sort of
> "OSD-lite" device which is optimized for sync writes but misses out on a lot 
> of
> the distributed functionality of a full fat OSD. This way you could create a
> writeback pool and then just specify it in the librbd config.
> >>
> >
> > That’s the same as giving a local device to the guest for caching, just one
> level up. The elasticity/migration problem is still there, and you need to
> invent another mechanism to flush the cache when something happens - all
> this work is done in RBD cache already, it just isn’t non-volatile.
> >
> > CEPH tiering will one day accomplish this to an extent, but even if you have
> a super-fast local network with super-fast SSDs, the trip-time will be the
> limiting factor here - writing over Fibre-Channel to a SAN cache is very very
> fast compared to what the write latencies are right now, even for SSD-
> backed journals. I certainly see this happening one day, but not anytime
> soon.
> >
> > I’d love to hear some thoughts from CEPH developers on this.
> > I think that right now the only place where this can be cached comparably
> fast is RBD cache (just needs crash-consistency and true writeback mode).
> > Another way would be to return the write as completed when the IO
> reaches all OSD buffers (DRBD does it in one mode) - this is safe as long as
> you don’t lose power on all OSDs holding the replicas at the same time, and
> in the worst case you only lose the most recent transactions.
> > The optimal solution is to vastly decrease journal write latency, I guess 
> > this
> would require most work.
> >
> >
> >>>
> >>> I wonder if cache=unsafe does what I want, but it’s hard to test
> >>> that assumption unless something catastrophic happens like it did
> >>> with EIO…
> >>
> >> I think this option just doesn't act on flushes so will likely lose
> >> data....probably not what you want
> >>
> >
> > This is exactly what I want. But there’s a difference between not
> > flushing (you lose 1 minute of writes but everything is consistent)
> > and reordering the IO (you end up with filesystem journal oblivious to
> > the changes that might/or might have not really happened)
> 
> 
> A couple of other things.  Have you tried:
> 
> - disabling authentication
> - disabling in-memory logging
> 
> Another area that we probably want to look at is CRC.  you can only disable
> this in the messenger, not the journal, and disabling it isn't a good general
> solution, but it would be worth investigation.
> 
> disabling authentication for instance can definitely have an effect on sync
> performance.  See Page 6 and Page 9 here:
> 
> http://nhm.ceph.com/Ceph_SSD_OSD_Performance.pdf
> 
> Mark

Hi Mark, I think the real problem is that even tuning Ceph to the Max it is 
still potentially 100x slower than a hardware raid card for doing these very 
important sync writes. Especially in DB's that have been designed to rely on 
the fact they can submit a large chain of very small IO's, without some sort of 
cache sitting at the front of the whole Ceph infrastructure (Journals and cache 
tiering are too far back), Ceph just doesn't provide the required latency. I 
know it would be really quite a large piece of work, but implementing some sort 
of distributed cache with a very low overhead that could plump direct into 
librbd would dramatically improve performance, especially in a lot of 
enterprise workloads.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to